Award  Number:  W81XWH-04-1-0495 


AD 


TITLE:  Investigation  of  Three-Group  Classifiers  to  Fully  Automate  Detection  and 
Classification  of  Breast  Lesions  in  an  Intelligent  CAD  Mammography  Workstation 


PRINCIPAL  INVESTIGATOR:  Darrin  C.  Edwards,  Ph.D. 

Charles  E.  Metz,  Ph.D 
Maryellen  L.  Giger,  Ph.D. 

CONTRACTING  ORGANIZATION:  The  University  of  Chicago 

Chicago,  IL  60637 


REPORT  DATE:  May  2007 


TYPE  OF  REPORT:  Final 


PREPARED  FOR:  U.S.  Army  Medical  Research  and  Materiel  Command 
Fort  Detrick,  Maryland  21702-5012 


DISTRIBUTION  STATEMENT:  Approved  for  Public  Release; 

Distribution  Unlimited 


The  views,  opinions  and/or  findings  contained  in  this  report  are  those  of  the  author(s)  and 
should  not  be  construed  as  an  official  Department  of  the  Army  position,  policy  or  decision 
unless  so  designated  by  other  documentation. 


REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and  maintaining  the 
data  needed,  and  completing  and  reviewing  this  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including  suggestions  for  reducing 
this  burden  to  Department  of  Defense,  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports  (0704-0188),  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202- 
4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  information  if  it  does  not  display  a  currently 
valid  OMB  control  number.  PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS. 

1.  REPORT  DATE 

01-05-2007 

2.  REPORT  TYPE 

Final 

3.  DATES  COVERED 

1  May  2004-30  Apr  2007 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Investigation  of  Three-Group  Classifiers  to  Fully  Automate  Detection  and 

Classification  of  Breast  Lesions  in  an  Intelligent  CAD  Mammography  Workstation 

5b.  GRANT  NUMBER 

W81XWH-04-1-0495 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

Darrin  C.  Edwards,  Ph.D.,  Charles  E.  Metz,  Ph.D,  Maryellen  L.  Giger,  Ph.D. 

5e.  TASK  NUMBER 

Email:  d-edwards® uchicago.edu 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

The  University  of  Chicago 

Chicago,  IL  60637 

8.  PERFORMING  ORGANIZATION  REPORT 
NUMBER 

9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

U.S.  Army  Medical  Research  and  Materiel  Command 

Fort  Detrick,  Maryland  21702-5012 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 

Approved  for  Public  Release;  Distribution  Unlimited 


13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

Our  goal  is  to  develop  a  fully  automated  classification  scheme  for  computer-aided  diagnosis  in  mammography.  Our  proposed 
scheme  would  classify  computer  detections  into  three  groups:  malignant  lesions,  benign  lesions,  and  false-positive  computer 
detections.  We  proved  that  the  area  under  the  ROC  curve  (AUC)  is  not  useful  in  classification  tasks  with  three  or  more  groups, 
and  showed  that  the  three  decision  boundary  lines  used  by  the  three-group  ideal  observer  are  intricately  related  to  one 
another.  We  analyzed  several  recently  proposed  three-group  classification  methods  in  terms  of  the  ideal  observer.  We 
collected  a  database  of  270  mammographic  images  with  clustered  microcalcification  lesions.  We  have  developed  a  novel 
performance  metric  that  may  generalize  better  than  AUC  to  tasks  with  more  than  two  groups.  A  three-group  classifier  could 
potentially  allow  radiologists  to  detect  more  malignant  breast  lesions  without  increasing  their  false-positive  biopsy  rates. 


15.  SUBJECT  TERMS 

Computer-aided  diagnosis,  X-ray  imaging 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION 

OF  ABSTRACT 

18.  NUMBER 

OF  PAGES 

19a.  NAME  OF  RESPONSIBLE  PERSON 

USAMRMC 

a.  REPORT 

u 

b.  ABSTRACT 

U 

c.  THIS  PAGE 

u 

uu 

97 

19b.  TELEPHONE  NUMBER  (include  area 
code) 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std.  Z39.18 


1  Introduction .  4 

2  Body  .  4 

3  Key  Research  Accomplishments .  9 

4  Reportable  Outcomes .  10 

5  Conclusions .  11 

References .  12 

Appendices 

A  Bibliography .  15 

B  List  of  Personnel  .  16 

C  The  Hypervolume  under  the  ROC  Hypersurface  of  “Near-Guessing”  and 

“Near-Perfect”  Observers  in  A-Class  Classification  Tasks . 17 

D  Evaluating  Bayesian  ANN  estimates  of  ideal  observer  decision  variables 

by  comparison  with  identity  functions . 25 

E  Review  of  several  proposed  three-class  classification  decision  rules  and 

their  relation  to  the  ideal  observer  decision  rule . 35 

F  Analysis  of  proposed  three-class  classification  decision  rules  in  terms  of 

the  ideal  observer  decision  rule . 46 

G  Restrictions  on  the  Three-Class  Ideal  Observer’s  Decision  Boundary  Lines  57 

H  Optimization  of  an  ROC  hypersurface  constructed  only  from  an  ob¬ 
server’s  within-class  sensitivities . 66 

I  Optimization  of  restricted  ROC  surfaces  in  three-class  classification  tasks  74 

J  A  utility-based  performance  metric  for  ROC  analysis  of  N-class  classifi¬ 
cation  tasks . 87 


1  Introduction 


Our  goal  is  to  develop  a  fully  automated  classification  scheme  for  computer-aided  diag¬ 
nosis  (CAD)  in  mammography.  Traditional  CAD  classification  schemes,  and  performance 
measurement  tools  such  as  receiver  operating  characteristic  (ROC)  analysis,  are  based  on 
the  premise  that  the  observations  are  classified  into  two  groups,  most  commonly  malignant 
and  benign.  Such  classification  schemes  are  difficult  to  fully  automate,  as  they  analyze 
radiologist-identified  lesions;  this  is  because  many  false-positive  (FP)  detections  produced 
by  a  computerized  detection  scheme  cannot  reasonably  be  classified  as  benign  or  malignant 
lesions.  Our  proposed  scheme  would  classify  computer  detections  into  three  groups:  malig¬ 
nant  lesions,  benign  lesions,  and  FP  computer  detections.  This  method  presents  considerable 
difficulties  in  terms  of  both  signal  detection  theory  and  performance  evaluation  methods  such 
as  ROC  analysis.  Our  efforts  in  this  direction  during  the  course  of  the  supported  research 
were  thus  generally  more  theoretical  than  practical.  However,  we  consider  the  results  of  our 
work  both  promising  and  important. 

2  Body 

A  wide  variety  of  medical  decision-making  tasks,  in  particular  tasks  for  which  CAD  has  been 
proposed  as  an  aid  to  the  physician,  can  be  formulated  as  “two-group  classification”  tasks. 
That  is,  the  physician  must  use  the  information  available  about  a  patient  (e.  g.,  a  set  of 
mammographic  films  of  the  patient,  and  the  result  of  computer  analysis  of  those  images)  to 
decide  whether  a  patient  belongs  to  a  diseased,  or  abnormal,  group  or  not  (e.  g.,  whether  a 
breast  lesion  suspicious  enough  to  warrant  further  imaging  procedures  or  biopsy  is  present 
or  not). 

ROC  analysis  has  long  been  considered  the  most  appropriate  methodology  for  evaluating 
the  performance  of  a  two-group  classifier  or  observer  [1],  particularly  for  medical  decision¬ 
making  tasks  [2],  Furthermore,  the  optimal  or  “ideal”  observer  —  that  observer  which 
achieves  the  best  possible  performance  given  a  particular  population  of  observational  data 
-  has  also  been  well  understood  for  quite  some  time  [3].  In  practice,  the  ideal  observer 
requires  knowledge  of  the  probability  density  functions  (PDFs)  from  which  the  observational 
data  are  drawn,  and  thus  cannot  be  achieved  in  non-trivial  tasks  by  human  or  automated 
observers.  Nevertheless,  successful  methods  for  estimating  ideal  observer  decision  variables 
from  a  sample  of  observational  data  [4],  and  for  plotting  an  ideal  observer  ROC  curve  from 
a  sample  of  decision  variable  data  [5] ,  have  been  developed. 

Although  the  form  of  the  three-group  ideal  observer  has  also  been  known  for  some  time  [3] , 
the  development  of  a  practical  three-group  classifier  and  a  fully  general  extension  of  ROC 
analysis  to  three-group  classification  has  proven  quite  difficult,  primarily  due  to  the  tremen¬ 
dous  increase  in  complexity  encountered  when  one  moves  from  two-group  to  three-group  clas¬ 
sification  tasks.  Briefly,  characterizing  the  performance  of  a  three-group  classifier  requires  an 
ROC  “hypersurface”  with  five  degrees  of  freedom  in  a  six-dimensional  ROC  space  [6,7]  (by 
contrast,  a  two-group  classifier  is  fully  described  by  a  simple  curve  in  a  two-dimensional  ROC 
space).  Despite  these  difficulties,  our  research  efforts  are  focused  on  the  development  of  a 
three-group  classifier  and  performance  evaluation  methodology  for  breast  lesion  classification 
in  a  mammographic  CAD  system. 

We  strongly  believe  the  development  of  such  a  three-group  classifier  to  be  of  practical  and 
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not  merely  academic  importance.  In  the  past,  two  types  of  mammographic  CAD  schemes 
have  been  investigated  at  the  University  of  Chicago:  one  for  automatically  detecting  mass 
lesions  in  mammograms  [8-12],  and  another  for  classifying  known  lesions  as  malignant  or 
benign  [13-17].  Combining  these  two  types  of  CAD  schemes  is  inherently  difficult,  because 
the  output  of  the  detection  scheme,  which  identifies  candidates  for  subsequent  classification, 
will  necessarily  include  FP  computer  detections  in  addition  to  the  malignant  and  benign 
lesions  to  be  classified.  These  FP  computer  detections  correspond  to  objects  which  were 
by  design  not  included  in  the  training  sample  of  the  classification  scheme,  because  they  are 
not  members  of  the  data  population  (benign  and  malignant  breast  lesions)  for  which  the 
classification  scheme  was  created.  It  is  clear  then  that  the  detection  scheme’s  output  cannot 
be  used  unmodified  as  the  input  to  the  classification  scheme. 

Our  approach  has  been  to  treat  this  problem  explicitly  as  a  three-group  classification 
task.  That  is,  the  output  of  the  detection  scheme  should  be  classified  as  malignant  lesions, 
benign  lesions,  and  non-lesions  (FP  computer  detections),  and  the  classifier  to  be  estimated 
is  the  ideal  observer  decision  function  for  this  task.  If  successful,  this  approach  would  allow 
radiologists  to  identify  more  malignant  lesions  without  increasing  biopsy  rates  for  patients 
without  malignancy. 

Our  approved  Statement  of  Work  was  as  follows: 

Task  1.  Develop  a  three-group  classifier  for  clustered  microcalcifications  in  mammograms,  Months 
1-12. 

(a)  Collect  cases  containing  180  malignant  and  180  benign  clusters  of  microcalcifica¬ 
tions. 

(b)  Determine  truth  state  of  imaged  lesions  by  reviewing  the  images,  radiologist  re¬ 
ports,  and  pathology  reports  for  these  cases. 

(c)  Obtain  at  least  180  FP  computer  detections  from  these  cases  using  the  existing 
detection  scheme. 

(d)  Train  and  test  a  three-group  classifier  on  these  lesions,  using  methodology  we 
previously  developed  for  mass  lesions. 

Task  2.  Design  and  develop  an  interface  for  an  intelligent  workstation  for  CAD,  Months  11-14. 

(a)  Examine  the  most  useful  features  of  the  interface  of  the  existing  intelligent  CAD 
workstation  for  mammographic  lesion  detection. 

(b)  Examine  the  most  useful  features  of  the  interface  of  the  existing  CAD  schemes  in 
our  laboratory  for  classifying  manually  detected  lesions  as  malignant  or  benign. 

(c)  Develop  a  simple  interface  drawing  on  the  advantages  of  the  existing  detection 
and  classification  schemes,  extended  to  the  three-group  classification  task. 

(d)  Test  the  interface  with  non-radiologist  observers  in  our  laboratory  familiar  with 
the  goals  of  CAD  and  with  interface  design  principles. 

Task  3.  Design  and  perform  a  pilot  observer  study  measuring  radiologists’  performances  using 
the  three-group  classification  schemes  and  traditional  two-group  classification  schemes, 
Months  15-24. 

(a)  Recruit  radiologists  from  our  institution  and  neighboring  institutions. 
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(b)  Provide  training  to  the  radiologists  in  the  use  of  the  intelligent  CAD  workstation 
interfaces. 

(c)  Measure  radiologist  performance  using  the  three-group  intelligent  workstation, 
and  using  the  existing  intelligent  workstation  for  detecting  lesions  followed  by 
manual  selection  of  lesions  to  be  analyzed  by  the  existing  schemes  for  two-group 
classification  of  lesions. 

Task  4.  Develop  techniques  to  compare  radiologists’  performance  in  using  the  proposed  three- 
group  and  traditional  two-group  classification  schemes,  Months  18-36. 

(a)  Develop  methodology  to  extend  two-group  ROC  analysis  to  tasks  in  which  obser¬ 
vations  are  classified  into  three  groups. 

(b)  Develop  methodology  to  determine  the  statistical  significance  of  measured  differ¬ 
ences  in  performance  between  three-group  classifiers. 

(c)  Use  this  methodology  to  analyze  the  observer  data  obtained  in  Task  3. 

For  Tasks  1(a)  and  1(b),  we  collected  during  the  second  year  of  this  project  a  database  of  134 
mammographic  cases,  four  standard  views  per  case;  the  majority  of  these  cases  contained 
malignant  or  benign  clustered  microcalcification  lesions.  During  the  course  of  the  past  year, 
however,  the  images  were  found  to  be  unsuitable  for  our  purposes.  We  therefore  collected 
another  set  of  270  images,  142  of  which  contained  benign  microcalcification  clusters,  and 
128  of  which  contained  malignant  microcalcification  clusters.  The  truth  for  the  malignant 
microcalcification  lesions  was  verified  by  pathology  report,  and  that  for  the  benign  lesions  by 
pathology  report  when  biopsy  was  recommended,  or  by  followup  when  that  was  recommended 
by  the  original  radiologist.  This  is  less  than  the  number  of  malignant  and  benign  lesions 
initially  proposed  for  this  project,  but  we  will  have  the  opportunity  to  supplement  these  with 
further  such  cases  from  the  database  of  a  colleague  in  our  laboratories  should  the  research 
continue  under  other  funding  mechanisms  (see  Sec.  4). 

For  Tasks  1(c)  and  1(d),  we  initially  encountered  difficulties  porting  the  computer  code 
for  the  existing  detection  scheme  from  the  legacy  equipment  for  which  it  was  written  (IBM 
RISC  6000  machines,  whose  operating  systems  are  no  longer  supported  and  whose  hardware 
is  too  old  to  be  considered  reliable)  to  a  modern  PC  workstation  running  a  Linux  operating 
system.  These  difficulties  were  traced  to  compiler  incompatibilities  between  the  two  systems. 
A  computer  programmer  in  our  laboratory  with  extensive  experience  with  both  systems  and 
intimate  familiarity  with  the  internals  of  the  detection  scheme  investigated  and  eliminated 
the  majority  of  these. 

We  had  planned  to  submit  a  paper  to  Medical  Physics  reporting  on  the  results  for  Task  1. 
In  fact,  we  are  quite  close  to  obtaining  the  final  results  needed  for  completing  such  a  paper. 
Unfortunately,  the  principal  investigator  very  recently  discovered  an  error  in  the  code  he  had 
written  [18]  to  interface  between  the  numerical  programming  environment  we  use  (matlab) 
and  the  Bayesian  artificial  neural  network  (BANN)  package  of  MacKay  [19]  that  serves  as 
the  basis  of  our  classifier  [4, 18].  We  fully  expect  the  relevant  experiments  incorporating  the 
corrected  code  to  be  completed  soon,  and  should  be  able  to  submit  a  paper  describing  these 
results  to  Medical  Physics  within  another  two  or  three  weeks.  We  will  then  submit  to  the 
USAMRMC  an  addendum  to  this  report  including  those  final  results. 

Our  research  accomplishments  focused  largely  on  Task  4.  Although  the  “methodology  we 
previously  developed  for  mass  lesions”  [20]  was  successful  for  estimating  ideal  observer  deci- 
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sion  variables  based  on  lesion  feature  data,  a  practical  classifier  to  make  use  of  this  decision 
variable  data  has  not  yet  been  implemented.  As  the  difficulties  in  theoretically  characterizing 
the  behavior  of  such  a  three-group  classifier  are  intimately  related  to  evaluation  of  such  a 
classifier’s  performance  (i.  e.,  the  development  of  a  three-group  extension  to  ROC  analysis), 
such  a  reordering  of  the  approved  tasks  seemed  logically  justified.  In  fact,  the  theoretical 
difficulties  involved  in  completely  characterizing  the  general  behavior  of  a  three-group  ideal 
observer,  and  in  developing  a  three-group  extension  to  ROC  analysis,  prevented  us  from 
accomplishing  Tasks  2  or  3.  However,  proposed  further  work  on  those  theoretical  issues, 
and  on  the  development  of  such  a  classification  scheme  for  CAD  and  its  evaluation  through 
radiologist  observer  studies,  served  as  the  basis  for  two  research  grant  applications  for  which 
we  have  applied.  These  are  listed  in  Sec.  4;  if  either  is  funded,  it  will  provide  support  for 
the  principal  investigator  at  the  assistant  professor  level. 

By  far  the  most  important  result  achieved  so  far  was  our  discovery  and  proof  (published 
during  the  first  year  of  support)  that  an  obvious  generalization  of  the  well-known  performance 
metric,  the  area  under  the  ROC  curve  (AUC),  is  not  in  fact  useful  in  tasks  with  three  or 
more  groups  [21].  (See  Appendix  C.)  This  accomplishment  relates  directly  to  Task  4.(b) 
above,  which  implicitly  requires  a  well-defined  performance  metric  with  respect  to  which  the 
statistical  significance  of  differences  in  performance  may  be  computed.  Although  arguably 
a  “negative”  rather  than  “positive”  result  —  a  well-defined  performance  metric  has  not  yet 
been  found  —  this  result  has  been  very  well  received  in  the  observer  performance  and  CAD 
research  communities.  First,  it  serves  as  a  striking  yet  typical  example  of  how  intuition 
can  often  be  an  unreliable  guide  in  extending  methodology  from  the  two-group  classification 
task  to  tasks  with  three  or  more  groups.  Second,  it  clearly  indicates  that  the  search  for 
such  a  well-defined  performance  metric  will  yield  a  deeper  understanding  of  the  properties 
of  three-group  observer  performance,  particularly  as  characterized  by  ROC  analysis. 

We  stated  above  that  exact  determination  of  the  ideal  observer’s  decision  variables  re¬ 
quires  knowledge  of  the  PDFs  from  which  the  observational  data  to  be  classified  were  drawn. 
The  tool  we  have  been  using  for  some  time  now  to  estimate  ideal  observer  decision  variables 
from  samples  of  observational  data  is  the  BANN  [19].  In  previous  simulation  studies  in 
which  the  PDFs  of  the  observational  data  are  known,  the  output  of  the  BANN  was  found 
to  agree  with  the  calculated  ideal  observer  decision  variables  for  two-group  [4]  and  three- 
group  [18]  classification  tasks.  In  practice,  one  does  not  have  the  PDFs  of  real  observational 
data,  but  we  previously  developed  a  means  of  evaluating  three-group  BANN  decision  vari¬ 
ables  by  comparing  them  with  two-group  BANN  decision  variables  obtained  from  simplified 
two-group  tasks  using  the  same  observational  data  [20].  During  the  first  year  of  support, 
we  developed  an  independent  technique  for  evaluating  three-group  BANN  estimates  of  ideal 
observer  decision  variables,  again  based  on  theoretical  properties  of  the  three-group  ideal 
observer  [22],  (See  Appendix  D.)  This  result  is  important  because  the  three-group  classifier 
we  are  developing  under  the  current  research  will  be  trained  and  tested  using  feature  data 
from  actual  mammograms;  thus,  we  will  not  have  access  to  the  PDFs  from  which  those  data 
are  drawn.  In  addition  to  three-group  ROC  analysis  methods  to  be  developed  by  extension 
from  existing  two-group  methods  [5],  it  will  be  beneficial  to  have  a  direct  method  of  judging 
the  ability  of  the  BANN  decision  variables  to  accurately  estimate  ideal  observer  decision 
variables. 

During  the  first  and  second  years  of  support,  we  investigated  in  great  detail  the  behavior 
of  the  three-group  ideal  observer.  In  particular,  it  is  well-known  that  the  three-group  ideal 
observer  makes  decisions  by  partitioning  a  plane  of  two  decision  variables  into  three  regions 
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using  three  decision  boundary  lines  [3].  We  showed  that  the  locations  and  orientations 
of  these  decision  boundary  lines  are  not  arbitrary;  given  the  slopes  and  ^-intercepts,  for 
example,  of  two  of  the  lines,  those  of  the  third  line  are  constrained  to  lie  within  a  particular 
range  of  values  [23].  (See  Appendix  G.)  A  detailed  understanding  of  such  properties  of  the 
three-group  ideal  observer  will  prove  crucial  to  the  calculation  of  observer  ROC  operating 
points,  and  by  extension  to  observer  performance  evaluation  in  general. 

In  our  efforts  to  develop  a  three-group  classifier  and  appropriate  performance  evaluation 
methodology,  we  have  made  every  attempt  to  keep  our  analysis  as  general  as  possible  de¬ 
spite  the  theoretical  difficulties  this  entails.  Other  researchers  have  proposed  three-group 
methodology  by  considering  observers  whose  behavior  is  restricted  in  particular  ways,  or  by 
considering  only  a  subset  of  the  possible  performance  characterization  indices  (the  axes  of 
ROC  space),  or  both  [24-28].  The  inherent  complexity  of  the  three-group  classification  task 
makes  direct  comparison  of  different  methods  by  different  researchers  difficult.  To  facilitate 
such  a  comparison,  we  analyzed  the  different  methods  in  terms  of  the  three-group  ideal  ob¬ 
server,  both  in  preliminary  work  [29]  (see  Appendix  E)  and  later  through  more  in-depth 
analysis  [30].  (See  Appendix  F.)  In  addition  to  providing  us  with  valuable  insight  and  expe¬ 
rience  in  comparing  different  classifiers,  which  should  ultimately  prove  directly  relevant  to 
the  completion  of  Task  4,  this  work  also  enabled  us  to  present  to  the  observer  performance 
and  CAD  research  communities  a  useful  framework  within  which  comparison  of  superficially 
very  different  classifiers  can  readily  be  made.  A  poster  presentation  of  the  theoretical  results 
of  this  and  the  preceding  paragraph,  as  well  as  our  research  accomplishments  during  the  first 
year  of  this  award,  was  made  at  the  2005  US  DOD  Breast  Cancer  Research  Program  Era  of 
Hope  Meeting  in  Philadelphia,  PA  [31]. 

In  the  second  and  third  years  of  support,  we  analyzed  a  simplified  performance  evalua¬ 
tion  method  (i.  e.,  an  extension  of  ROC  analysis  to  tasks  with  three  groups)  which  considers 
only  the  three  “sensitivities”  of  the  observer  —  the  three  probabilities  of  correctly  iden¬ 
tifying  an  observation  from  one  of  the  three  respective  groups.  (This  can,  in  general,  be 
expected  to  yield  an  incomplete  description  of  observer  performance,  which  requires  a  set 
of  six  conditional  classification  probabilities  [7].)  This  method  was  originally  proposed  by 
Mossman  [26]  for  a  pair  of  essentially  ad  hoc  decision  rules  and  arbitrary  decision  variables, 
and  more  recently  advocated  by  He  etal.  [28]  for  a  set  of  ideal  observer  decision  variables 
and  a  decision  rule  shown  [28-30]  to  be  a  special  case  of  the  ideal  observer  decision  rule, 
and  also  shown  [29,30]  to  be  a  special  case  of  the  decision  rule  proposed  by  Scurfield  [25]. 
We  were  able  to  derive  a  more  fundamental  motivation  for  the  decision  rules  described  in 
those  works,  given  the  simplified  performance  description  in  terms  of  only  the  sensitivities, 
by  applying  previously  successful  Neyman-Pearson  optimization  methodology  [3,  7]  to  this 
restricted  performance  evaluation  strategy. 

Simply  put,  assuming  that  one  chooses  to  measure  observer  performance  only  in  terms 
of  the  observer’s  sensitivities,  we  proved  [32]  that  the  optimal  observer  with  respect  to  this 
metric  is  in  fact  the  special  case  of  the  ideal  observer  proposed  by  He  etal.  [28].  (See 
Appendix  H.)  We  then  applied  this  analysis  technique  [33]  to  other  decision  strategies  and 
performance  evaluation  strategies  which  we  had  previously  analyzed  in  terms  of  the  ideal 
observer  decision  rule  [30].  (See  Appendix  I.)  Given  the  difficulties  inherent  in  a  fully  general 
description  of  three-class  ideal  observer  behavior  and  performance  evaluation,  it  is  possible 
that  a  restricted  or  simplified  model,  similar  to  those  proposed  already  by  other  researchers, 
may  ultimately  prove  of  greater  practical  value  than  the  fully  general  theoretical  model. 
We  consider  this  work  important,  because  it  provides  a  principled  theoretical  framework  in 


which  to  evaluate  and  compare  such  restricted  and  simplified  models. 

As  stated  above,  a  well-defined  performance  metric  is  required  in  order  to  understand 
the  properties  of  three-group  observer  performance,  particularly  as  characterized  by  ROC 
analysis.  Furthermore,  we  showed  that  an  obvious  generalization  of  the  AUC  does  not  in 
fact  prove  useful  in  tasks  with  three  or  more  groups  [21],  During  the  third  year  of  support, 
we  developed,  and  presented  preliminary  results  of  studies  involving,  a  novel  “utility” -based 
performance  metric  [34],  (See  Appendix  J.)  In  the  beginning  of  this  section,  we  introduced 
the  concept  of  the  ideal  observer  as  that  observer  which  achieves  the  best  possible  perfor¬ 
mance  given  a  particular  population  of  observational  data.  One  way  of  deriving  the  ideal 
observer  model  is  to  assign  a  number,  the  utility,  to  each  possible  decision;  the  ideal  observer 
is  then  that  observer  which  maximizes  the  expected  utility  [3,7].  Our  proposed  performance 
metric  is  grounded  in  this  concept  of  the  utility  of  an  observer’s  decisions,  and  can  be  shown 
to  be  directly  related  to  intuitive  properties  of  the  observer’s  ROC  curve  (AUC  and  the  arc 
length  along  the  curve  or,  for  tasks  with  more  than  two  groups,  the  hypervolume  under  the 
ROC  hypersurface  and  the  hypersurface  itself).  Although  further  analysis  will  be  necessary 
to  fully  characterize  the  properties  of  this  novel  performance  metric,  we  have  high  hopes 
that  it  will  prove  to  be  of  use  in  characterizing  observer  performance  without  being  subject 
to  the  limitations  we  have  shown  exist  for  a  more  obvious  generalization  of  the  AUC. 

A  detailed  understanding  of  the  properties  of  the  general  three-group  ideal  observer,  and 
of  the  restricted  and  simplified  ROC  models  described  above,  will  ultimately  prove  crucial  to 
the  calculation  of  observer  ROC  operating  points,  and  by  extension  to  observer  performance 
evaluation  in  general.  Throughout  the  course  of  this  project,  the  principal  investigator  and 
mentor  have  held  regular  meetings  to  discuss  the  theoretical  challenges  posed  by  this  project 
and  to  explore  possible  ways  of  overcoming  those  challenges. 


3  Key  Research  Accomplishments 

•  Proof  that  an  obvious  generalization  of  the  well-known  two-group  performance  metric, 
the  AUC,  is  not  useful  in  classification  tasks  with  three  or  more  groups  (Appendix  C) 

•  Development  of  a  novel  technique  for  evaluating  the  quality  of  BANN  estimates  of  ideal 
observer  decision  variables  in  the  absence  of  three-group  ROC  analysis  methodology 
and  observational  data  PDFs  (Appendix  D) 

•  Detailed  investigation  of  the  relationships  among  the  decision  boundary  lines  used  by 
the  three-group  ideal  observer  (Appendix  G) 

•  Analysis  of  several  proposed  three-group  classification  methods  in  the  literature  in 
terms  of  the  three-group  ideal  observer  (Appendices  E,  F) 

•  Development  of  principled  theoretical  motivation  for  proposed  three-group  classifica¬ 
tion  methods  given  selection  of  restricted  or  simplified  three-group  evaluation  method¬ 
ology  (Appendices  H,  I) 

•  Development  and  preliminary  analysis  of  a  novel  utility-based  performance  metric, 
which  we  hope  will  generalize  better  to  classification  tasks  with  more  than  two  groups 
than  does  the  conventional  AUC  (Appendix  J) 
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4  Reportable  Outcomes 

•  D.  C.  Edwards,  C.  E.  Metz,  and  R.  M.  Nishikawa,  “The  hypervolume  under  the  ROC 
hypersurface  of  ‘near-guessing’  and  ‘near-perfect’  observers  in  iV-class  classification 
tasks,”  IEEE  Trans.  Med.  Imag .,  vol.  24,  pp.  293-299,  2005. 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Evaluating  Bayesian  ANN  estimates  of  ideal  observer 
decision  variables  by  comparison  with  identity  functions,”  in  Proc.  SPIE  Vol.  5749 
Medical  Imaging  2005:  Image  Perception,  Observer  Performance,  and  Technology  As¬ 
sessment ,  Miguel  P.  Eckstein  and  Yulei  Jiang,  Eds.,  SPIE,  Bellingham,  WA,  2005,  pp. 
174-182.  [Conference  presentation  and  proceedings  paper.] 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Review  of  several  proposed  three-class  classification 
decision  rules  and  their  relation  to  the  ideal  observer  decision  rule,”  in  Proc.  SPIE  Vol. 
5749  Medical  Imaging  2005:  Image  Perception,  Observer  Performance,  and  Technology 
Assessment,  Miguel  P.  Eckstein  and  Yulei  Jiang,  Eds.,  SPIE,  Bellingham,  WA,  2005, 
pp.  128-137.  [Conference  presentation  and  proceedings  paper.] 

•  Collection  of  database  of  270  mammographic  cases  containing  malignant  and  benign 
clustered  microcalcification  lesions,  with  truth  determined  by  pathology  (for  biopsied 
lesions)  or  mammographic  followup  (benign  lesions  only) 

•  Porting  of  existing  computerized  scheme  for  detecting  clustered  microcalcifications  in 
mammograms  from  legacy  computer  systems  no  longer  in  operation  to  workstations 
currently  in  use  for  this  project 

•  D.  C.  Edwards,  C.  E.  Metz,  R.  M.  Nishikawa,  and  M.  L.  Giger,  “Investigation  of 
three-group  classifiers  to  fully  automate  detection  and  classification  of  breast  lesions 
in  computer-aided  diagnosis  for  mammography,”  US  DOD  Breast  Cancer  Research 
Program  Era  of  Hope  Meeting,  Philadelphia,  PA,  2005. 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Restrictions  on  the  three-class  ideal  observer’s  decision 
boundary  lines,”  IEEE  Trans.  Med.  Imag.,  vol.  24,  pp.  1566-1573,  2005. 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Analysis  of  proposed  three-class  classification  decision 
rules  in  terms  of  the  ideal  observer  decision  rule,”  J.  Math.  Psychol. ,  vol.  50,  pp.  478- 
487,  2006. 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Optimization  of  an  ROC  hypersurface  constructed 
only  from  an  observer’s  within-class  sensitivities,”  in  Proc.  SPIE  Vol.  6146  Medical 
Imaging  2006:  Image  Perception,  Observer  Performance,  and  Technology  Assessment, 
Yulei  Jiang  and  Miguel  P.  Eckstein,  Eds.,  SPIE,  Bellingham,  WA,  2006,  pp.  61460A1- 
61460A7.  [Conference  presentation  and  proceedings  paper.] 

•  D.  C.  Edwards  and  C.  E.  Metz,  “ROC  Analysis  in  Radiology:  The  State  of  the  Art,  and 
Recent  Ar- Class  Investigations,”  Third  Workshop  on  Receiver  Operating  Characteristic 
Analysis  in  Machine  Learning,  Pittsburgh,  PA,  2006.  (Invited  talk.) 

•  D.  C.  Edwards  and  C.  E.  Metz,  “A  utility-based  performance  metric  for  ROC  anal¬ 
ysis  of  N-class  classification  tasks,”  in  Proc.  SPIE  Vol.  6515  Medical  Imaging  2007: 
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Image  Perception,  Observer  Performance,  and  Technology  Assessment ,  Yulci  Jiang 
and  Berkman  Sahiner,  Eds.,  SPIE,  Bellingham,  WA,  2007,  pp.  6  515  031-65150  310. 
[Conference  presentation  and  proceedings  paper.] 

•  D.  C.  Edwards  and  C.  E.  Metz,  “Optimization  of  restricted  ROC  surfaces  in  three-class 
classification  tasks,”  IEEE  Trails.  Med.  Imag.,  2006,  (accepted  for  publication  5  Mar. 
2007). 

•  D.  C.  Edwards,  J.  Papaioannou,  C.  E.  Metz,  A.  V.  Edwards,  and  R.  M.  Nishikawa, 
“Estimating  three-class  ideal  observer  decision  variables  for  computerized  detection 
and  classification  of  mammographic  microcalcification  lesions,”  2007  (in  preparation). 

•  D.  C.  Edwards,  PP  “N-Class  Image  Classification  for  Computer-Aided  Breast  Cancer 
Diagnosis,”  application  for  support  under  NIH  K99/R00  funding  mechanism;  submit¬ 
ted  June  2006,  unscored,  resubmitted  March  2007. 

•  D.  C.  Edwards,  Project  co-Leader  under  C.  E.  Metz  (Project  Leader)  and  R.  M. 
Nishikawa  (Program  PI):  “Three-class  Receiver  Operating  Characteristic  Analysis  for 
Evaluation  of  Computer-Aided  Diagnosis,”  Project  3  of  Program  Project  Grant  “Trans¬ 
lating  Computer-Aided  Diagnosis  (CADx)  from  the  Lab  to  the  Clinic,”  application  for 
support  under  NIH  P01  funding  mechanism;  submitted  Oct.  2006,  merit  rating  1.7 
(overall  program  priority  score:  209),  resubmitted  May  2007. 


5  Conclusions 

During  the  first  year  of  support,  we  proved  that  an  obvious  generalization  of  the  well-known 
two-group  performance  metric,  the  AUC,  is  not  in  fact  a  useful  performance  metric  for  classi¬ 
fication  tasks  with  three  or  more  groups.  We  developed  an  evaluation  technique,  independent 
of  those  we  had  previously  developed,  for  assessing  the  ability  of  BANN  decision  variables 
to  accurately  estimate  ideal  observer  decision  variables.  We  analyzed  several  recently  pro¬ 
posed  three-group  classification  methods  in  terms  of  the  three-group  ideal  observer.  We  also 
showed  that  the  three  decision  boundary  lines  used  by  the  three-group  ideal  observer  are  not 
arbitrary,  but  are  intricately  related  to  one  another. 

During  the  second  year  of  support,  with  the  assistance  of  colleagues  in  our  laboratory,  we 
collected  a  database  of  134  mammographic  cases  containing  malignant  and  benign  clustered 
microcalcification  lesions,  with  truth  determined  by  pathology  (for  biopsied  lesions)  or  mam¬ 
mographic  followup  (benign  lesions  only),  and  we  ported  the  existing  computerized  scheme 
for  detecting  clustered  microcalcifications  in  mammograms  from  legacy  computer  systems 
no  longer  in  operation  to  workstations  currently  in  use  for  this  project.  We  reported  on  the 
important  theoretical  results  we  had  developed  to  date  at  the  2005  Breast  Cancer  Research 
Program  Era  of  Hope  Meeting.  We  also  developed  principled  theoretical  motivations  for 
various  proposed  three-group  classification  methods,  given  in  each  case  the  selection  of  a 
restricted  or  simplified  three-group  evaluation  methodology. 

Although  the  Erst  set  of  images  we  collected  proved  unsuitable  for  our  purposes,  we  were 
able  during  the  past  year  to  collect  270  mammographic  images  and  are  close  to  completing 
experiments,  using  these  images,  designed  to  evaluate  the  ability  of  BANNs  to  estimate  ideal 
observer  decision  variables  for  mammographic  lesion  feature  data  (as  opposed  to  simulated 
data).  The  principal  investigator  was  invited  to  give  a  talk  at  the  Third  Workshop  on 
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Receiver  Operating  Characteristic  Analysis  in  Machine  Learning  (a  conference  within  the 
International  Conference  on  Machine  Learning  symposium)  on  the  subject  of  the  state  of 
the  art  of  ROC  analysis  in  radiology  and  on  our  recent  investigations  in  classification  with 
more  than  two  groups.  We  have  also  continued  to  advance  our  theoretical  understanding 
of  the  three-group  ideal  observer  and  methods  of  evaluating  its  performance.  In  particular, 
we  have  developed  a  novel  utility-based  performance  metric  which  we  have  reason  to  believe 
may  be  useful  for  classification  tasks  with  more  than  two  groups  without  suffering  from  the 
limitations  of  more  obvious  generalizations  of  the  well-known  AUC  performance  metric. 

Although  our  primary  research  accomplishments  have  been  theoretical,  they  are  crucial 
steps  in  the  development  of  a  practical  three-group  classifier  and  a  fully  general  three-group 
performance  evaluation  methodology.  Despite  the  considerable  difficulties  involved  in  such 
development,  a  CAD  scheme  incorporating  a  three-group  classifier  as  we  propose  could  po¬ 
tentially  allow  radiologists  to  detect  more  malignant  breast  lesions  without  increasing  their 
FP  biopsy  rate.  We  believe  this  goal  to  be  worth  the  necessary  effort  on  our  part. 
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The  Hypervolume  Under  the  ROC  Hypersurface  of 
“Near-Guessing”  and  “Near-Perfect”  Observers  in 
N -Class  Classification  Tasks 

Darrin  C.  Edwards*,  Charles  E.  Metz,  and  Robert  M.  Nishikawa 


Abstract — We  express  the  performance  of  the  TV-class 
“guessing”  observer  in  terms  of  the  TV2  —  TV  conditional 
probabilities  which  make  up  an  TV-class  receiver  operating  char¬ 
acteristic  (ROC)  space,  in  a  formulation  in  which  sensitivities  are 
eliminated  in  constructing  the  ROC  space  (equivalent  to  using 
false-negative  fraction  and  false-positive  fraction  in  a  two-class 
task).  We  then  show  that  the  “guessing”  observer’s  performance 
in  terms  of  these  conditional  probabilities  is  completely  described 
by  a  degenerate  hypersurface  with  only  TV  —  1  degrees  of  freedom 
(as  opposed  to  the  TV2  —  TV  —  1  required,  in  general,  to  achieve  a 
true  hypersurface  in  such  a  ROC  space).  It  readily  follows  that  the 
hypervolume  under  such  a  degenerate  hypersurface  must  be  zero 
when  TV  >  2.  We  then  consider  a  “near-guessing”  task;  that  is,  a 
task  in  which  the  TV  underlying  data  probability  density  functions 
(pdfs)  are  nearly  identical,  controlled  by  TV  —  1  parameters  which 
may  vary  continuously  to  zero  (at  which  point  the  pdfs  become 
identical).  With  this  approach,  we  show  that  the  hypervolume 
under  the  ROC  hypersurface  of  an  observer  in  an  TV-class  classifi¬ 
cation  task  tends  continuously  to  zero  as  the  underlying  data  pdfs 
converge  continuously  to  identity  (a  “guessing”  task).  The  hyper¬ 
volume  under  the  ROC  hypersurface  of  a  “perfect”  ideal  observer 
(in  a  task  in  which  the  TV  data  pdfs  never  overlap)  is  also  found 
to  be  zero  in  the  ROC  space  formulation  under  consideration. 
This  suggests  that  hypervolume  may  not  be  a  useful  performance 
metric  in  TV-class  classification  tasks  for  TV  >2,  despite  the 
utility  of  the  area  under  the  ROC  curve  for  two-class  tasks. 

Index  Terms — TV-class  classification,  ROC  analysis,  ROC  per¬ 
formance  metrics. 


I.  Introduction 

WE  are  attempting  to  develop  a  fully  automated  mass 
lesion  classification  scheme  for  computer-aided  diag¬ 
nosis  (CAD)  in  mammography.  This  scheme  will  combine 
two  schemes  developed  at  the  University  of  Chicago:  one  for 
automatically  detecting  mass  lesions  in  mammograms  [1] — [5], 
and  one  for  classifying  known  lesions  as  malignant  or  benign 
[6] — [10].  Combining  these  two  types  of  CAD  scheme  is  inher¬ 
ently  difficult,  because  the  output  of  the  detection  scheme  will 
necessarily  include  false-positive  (FP)  computer  detections  in 
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addition  to  the  malignant  and  benign  lesions  to  be  classified. 
These  FP  computer  detections  correspond  to  objects  which 
were  by  design  not  included  in  the  training  sample  of  the 
classification  scheme,  because  they  are  not  members  of  the 
data  population  (benign  and  malignant  mass  breast  lesions)  for 
which  the  classification  scheme  was  created.  It  is  clear  then 
that  the  detection  scheme’s  output  cannot  be  used  unmodified 
as  the  input  to  the  classification  scheme. 

Our  approach  has  been  to  treat  this  problem  explicitly  as  a 
three-class  classification  task.  That  is,  the  outputs  of  the  detec¬ 
tion  scheme  should  be  classified  as  malignant  lesions,  benign 
lesions,  and  nonlesions  (FP  computer  detections),  and  the  clas¬ 
sifier  to  be  estimated  is  the  ideal  observer  decision  function  for 
this  task.  Such  an  approach  presents  considerable  difficulties  of 
its  own.  On  the  one  hand,  decision  functions,  in  particular  ideal 
observer  decision  functions,  increase  rapidly  in  complexity  with 
the  number  of  classes  involved.  On  the  other  hand,  fully  general 
performance  evaluation  methods,  in  particular  a  fully  general 
three-class  extension  of  receiver  operating  characteristic  (ROC) 
analysis,  have  yet  to  be  developed  for  such  a  task. 

Although  we  have  had  preliminary  success  in  using  Bayesian 
artificial  neural  networks  (BANNs)  [11],  [12]  to  estimate  three- 
class  ideal-observer-related  decision  variables  [13],  [14],  the 
task  of  developing  an  extension  of  ROC  analysis  to  classifica¬ 
tion  tasks  with  three  or  more  classes  has  proved  somewhat  more 
daunting.  Our  initial  efforts  in  this  direction  have,  thus,  been 
more  theoretical  than  practical  so  far  [15].  One  issue  we  began 
to  investigate  recently  was  the  calculation  of  an  obvious  gen¬ 
eralization  of  the  well-known  area  under  the  ROC  curve  (AUC) 
performance  metric,  a  quantity  we  are  calling  the  “hypervolume 
under  the  ROC  hypersurface,”  Detailed  consideration  of  the  in¬ 
tegrals  involved  in  calculating  this  quantity  led  us  to  the  coun¬ 
terintuitive  conclusion  that,  despite  the  great  success  and  utility 
of  the  AUC  performance  metric  in  two-class  classification  tasks, 
the  hypervolume  under  the  ROC  hypersurface  does  not  appear 
to  be  a  useful  performance  metric  in  TV-class  classification  tasks 
for  TV  >  2.  The  proof  of  this  claim  is  arrived  at  by  considering 
observer  performance  in  two  extremes:  the  “guessing”  observer 
and  the  “perfect”  observer.  It  should  be  explicitly  noted  that  in 
our  formulation,  sensitivities  are  eliminated  in  constructing  the 
ROC  space;  this  is  equivalent  to  using  false-negative  fraction 
(FNF)  and  false-positive  fraction  (FPF)  in  a  two-class  task.  In 
such  a  formulation,  the  “guessing”  observer  in  a  two-class  task 
achieves  an  AUC  of  0.5  as  expected,  but  the  “perfect”  observer 
in  a  two-class  task  achieves  an  AUC  of  zero. 

In  Section  II,  we  consider  the  properties  of  the  “guessing” 
observer  in  an  TV-class  classification  task,  and  of  its  ROC 
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hypersurface.  In  Section  III,  we  consider  the  properties  of  the 
ROC  hypersurface  of  a  so-called  “near-guessing”  observer, 
i.e.,  an  observer  in  a  task  for  which  the  observational  data 
probability  density  functions  (pdfs)  are  not  identical,  but  differ 
only  by  arbitrarily  small  amounts.  In  Section  IV,  we  then  show 
that  the  hypervolume  under  the  ROC  hypersurface  of  such 
a  “near-guessing”  observer  will  continuously  approach  the 
hypervolume  under  the  ROC  hypersurface  of  the  “guessing” 
observer  as  the  observational  data  pdfs  continuously  approach 
identity;  furthermore,  the  hypervolume  under  the  ROC  hyper¬ 
surface  of  the  “guessing”  observer  is  shown  to  be  zero. 

We  then  show  in  Section  V  that  the  hypervolume  under  the 
ROC  hypersurface  of  the  “perfect”  observer  is  zero  (as  expected 
by  analogy  with  the  two-class  task),  and  that  the  hypervolume 
under  the  ROC  hypersurface  of  a  “near-perfect”  observer  will 
approach  zero  continuously  as  the  observational  data  pdfs  are 
separated.  Finally,  in  Section  VI,  we  argue  that  these  results 
taken  together  imply  that  the  hypervolume  under  the  ROC  hy¬ 
persurface  is  not  a  useful  performance  metric  in  TV-class  classi¬ 
fication  tasks  for  TV  >  2,  despite  the  utility  of  the  AUC  perfor¬ 
mance  metric  in  two-class  tasks. 


II.  The  ROC  Hypersurface  of  the  TV-Class  “Guessing” 
Observer 

The  performance  of  an  observer  in  an  TV-class  classification 
task  is  completely  determined  by  a  hypersurface  with  TV2  —  TV  — 
1  degrees  of  freedom  in  an  (TV2  —  TV)-dimensional  ROC  space 
[16].  Without  loss  of  generality,  we  can  specify  any  point  in 
the  ROC  space  by  a  vector  of  the  misclassification  probabili¬ 
ties  [P(d  =  7Tl  |  t  =  7r2)|  •  •  •  ,  P(d  =  7Tl  |  t  =  7 Tjv),  P(d  = 
7T2  1 1  =  7Ti),P(d  =  7T2  I  t  =  7T3),  ...,P(d  =  7T2  |  t  = 
WN),P(d  =  7TiV  1 1  =  irN-i),P(d  =  7T/V  1 1  =  7Ti)]t  [15]. 
Here  the  TV  classes  are  denoted  by  the  labels  7Ti, . . . ,  7T\-;  d  de¬ 
notes  the  class  to  which  an  observation  is  assigned  (the  “de¬ 
cision”);  and  t  is  the  class  to  which  it  actually  belongs  (the 
“truth”).  We  use  boldface  type  to  denote  statistically  variable 
quantities.  For  simplicity,  we  write  P(d  =  7U  1 1  =  7 Tj)  as  P.;j . 

We  can  also,  again  without  loss  of  generality,  consider  the 
ROC  hypersurface  to  be  given  by  Pjvi  considered  as  a  function 
of  the  other  TV2  —  TV  —  1  misclassification  probabilities  [15]. 
Note  that  this  formulation  is  equivalent,  in  a  two-class  classi¬ 
fication  task,  to  using  FPF  and  FNF  to  characterize  the  ROC 
curve,  rather  than  FPF  and  true-positive  fraction  (TPF),  as  is 
more  common.  In  a  two-class  classification  task,  this  produces 
ROC  curves  which  are  “upside-down”  with  respect  to  the  stan¬ 
dard  formulation;  we  have  adopted  the  nonstandard  formulation 
described  above  because  it  has  proven  easier  to  generalize  to 
classification  tasks  with  more  than  two  classes. 

Some  researchers  have  suggested  [17],  [18]  that  in,  e.g.,  a 
three-class  classification  task,  the  set  of  three  “sensitivities” 
(P(d  =  tt7;  1 1  =  7 Vi)  in  our  notation)  provides  a  complete  de¬ 
scription  of  observer  performance.  This  is  incorrect  in  general, 
because  it  ignores  the  TV2  —  TV  misclassification  probabilities, 
not  all  of  which  are  determined  uniquely  by  the  “sensitivi¬ 
ties”  when  TV  >  2  unless  particular  restrictions  are  imposed 
on  the  observer’s  behavior.  Complete  quantification  of  the 
trade-offs  available  among  the  probabilities  of  various  kinds 


of  misclassification  error  is  important  in  medical  diagnosis, 
where  different  misclassification  errors  often  have  substan¬ 
tially  different  clinical  consequences.  Moreover,  restrictions 
concerning  the  observer’s  behavior  are  inappropriate  when 
considering  the  general  behavior  of  ideal  observers,  human 
observers,  or  automated  observers  (such  as  automated  schemes 
for  computer-aided  diagnosis)  designed  to  approximate  ideal 
or  human  observer  behavior.  Other  researchers  have  reduced 
the  three-class  ROC  hypersurface  to  more  tractable  two-dimen¬ 
sional  surfaces  in  three-dimensional  ROC  spaces  by  explicitly 
imposing  restrictions  on  the  form  of  the  observer’s  decision 
rule  [19],  [20],  or  on  the  utilities  used  by  an  ideal  observer 
[21].  While  such  restrictions  may  ultimately  prove  to  be  of 
great  pragmatic  importance  given  the  inherent  complexity  of 
multi-class  classification  tasks,  our  approach  so  far  has  been 
to  attempt  as  general  an  understanding  as  possible  of  the 
unrestricted  classification  task. 

Consider  the  performance  of  an  observer  which  makes  de¬ 
cisions  by  “guessing,”  that  is,  in  a  random  fashion  unrelated 
to  the  actual  class  t  from  which  a  given  observation  is  drawn. 
(Note  that  this  corresponds  to  the  performance  of  the  ideal  ob¬ 
server  when  the  pdfs  of  the  observational  data  are  identical,  i.e., 
p(x|7Ti)  =  p(x|7T2)  =  •••  =  p(x | 7Tjv).)  In  this  case,  we 
clearly  must  have 
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Defining  a,,;  =  P.,^  for  1  <  *  <  TV  —  1,  and  a jv  =  Pn(n-  1). 
we  see  that  the  performance  of  the  “guessing”  observer  is  given 
by  a  locus  of  vectors  of  the  form 
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where  all  of  the  ai  are  restricted  to  the  range  [0, 1],  Furthermore, 
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which  immediately  gives  am  =  1  —  X^i.1  ai ■  Thus,  the  per¬ 
formance  of  the  “guessing”  observer  is  given  by 
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This  is  the  parametric  equation  for  an  (TV— 1) -dimensional  plane 
in  an  (TV2  —  TV)-dimensional  space;  the  actual  performance  of 
the  “guessing”  observer  will  of  course  be  further  restricted  to  a 
region  within  this  plane  such  that  0  <  c*i  <  1,0  <1  —  E«i< 
1. 


III.  The  ROC  Hypersurface  of  an  TV-Class 
“Near-Guessing”  Observer 

Consider  observational  data  x  drawn  from  TV  pdfs 


II 

-p 

E 

X 

=  p(x  1 1  =  7 Tv)  +  6ihi(x) 

(7) 

p(x\t  =  7 Tj) 

=  p(x  1 1  =  77v)  +  Sjhj(x) 

(8) 

p(x  |  t  =  ttn) 

(9) 

where  0  <  S.j  <  1,|  hj(x)dnx  =  0,  and  \hj(x)\  <  p(x  |t  = 
7Tat)  for  1  <  j  <  TV  —  1.  In  the  limit  as  the  Sj  all  approach 
zero,  we  expect  the  performance  of  any  observer  for  this  task  to 
converge  smoothly  to  that  of  the  “guessing”  observer. 

Decisions  are  made  by  partitioning  the  decision  variable 
space  into  TV  regions,  determined  by  a  total  of  TV2  —  TV  —  1 
parameters;  we  denote  these  parameters  by  the  components  of 
a  vector  7.  An  observer  which  uses  more  than  TV2  —  TV  —  1 
parameters  for  an  TV-class  classification  task  can  always  be 
replaced  by  a  simplified  observer,  such  that  the  “excess”  param¬ 
eters  are  eliminated  by  the  requirement  that  Pjy  1  be  minimized, 
thereby  collapsing  the  dimensionality  of  the  parameter  space  to 
TV2  —  TV  —  1.  On  the  other  hand,  an  observer  which  uses  fewer 
than  TV2  —  TV  —  1  decision  parameters  will  fail  to  generate  a 
true  ROC  hypersurface — i.e.,  one  with  TV2  —  TV  —  1  degrees 
of  freedom  in  the  (TV2  —  TV) -dimensional  ROC  space.  (An  ex¬ 
ample  in  a  three-class  classification  task  would  be  an  observer 
which  sequentially  performs  a  pair  of  binary  classification 
tasks  by  first  classifying  observations  as  being  “tti”  or  “not 
7Ti”  based  on  the  value  of  a  single  decision  parameter,  and  then 
further  classifying  the  “not  7Ti”  observations  as  “712”  or  “713” 


based  on  the  value  of  a  second  decision  parameter  [17],  thus 
depending  on  fewer  than  the  five  degrees  of  freedom  needed  in 
a  three-class  classification  task.)  Such  “degenerate”  observers 
will  not  be  considered  here  (apart  from  the  “guessing”  observer 
itself). 

We  can,  thus,  define  TV  regions  which  partition  the  original 
data  space,  given  particular  values  of  the  parameters  7,  by 


P>1  (l)  =  {x  : 

:  d  =  7ri  given  7} 

GO) 

Pi{ 7)  =  {x  : 

:  d  =  77  given  7} 

(11) 

P>n(i)  =  {x  :  d  =  77V  given  7}.  (12) 

For  a  nonrandom  observer,  the  Vi  can  be  expected  to  depend 
implicitly  on  the  pdfs  (7)-(9)  and,  therefore,  on  the  Sj.  The  mis- 
classification  probabilities  which  define  the  ROC  hypersurface 
are  then  given  by 


r  Pi2  1 

Pis 

JVi  p{x  1 1  =  7T2)  dnx 
fVl  p(x\t=  7 T3)dnX 

Pin 

JVi  p(x  1 1  =  7 rjv)  dnx 

Pn 

fD  p(x  |  t  =  7Ti)  dnx 

Pij  ;{V"  j  } 

= 

fp  p(x\t.  —  7Tj)d".r  {i^j} 

Pin 

JT,.p(x\t  =  irN)dnx 

Pn(n-i) 

fvNP(%  1*  =  KN-i)dnx 

Pm 

JvNP(x  \t  =  TTl)dnX 

(13) 


Using  (7)  and  (8),  we  can  rewrite  this  as 


r  Pi2  1 

Pis 

Pin  +  f> 2  JVl  h2{x)  dnx 

Pin  +  S 3  JVi  h3(x)  dnx 

Pin 

Pin 

Pa 

PiN  +  ^1  JVi  hi(x)  dnx 

Pij  {i  7^  j  } 

= 

PiN  T  dj  j'p  (.x)  d  x  {ijS  j} 

PiN 

PiN 

Pn(n-i) 

Pnn  +  Sn-i  JVn  hN-i(x)  dnx 

Pni 

Pnn  +  h  JT>Nh1(x)dnx  _ 

(14) 
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Defining  the  functions  H.;j  =  Jv  hj(x)  dnx  allows  us  to  sim¬ 
plify  the  notation  slightly 

T  Pi 2  "I  r  Pin  +  P  II  .■ 

Pl3  pLJV  +  P  Pi  3 


Pin  +  6iHi  i 


Pij  I*  /  U  Pin  +  ijHij  /  j} 


where  the  vectors  Wj  have  components  which  depend  only  on 
Hjj.  The  first  term  on  the  right-hand  side  of  this  equation  is  just 
the  expression  for  the  “guessing”  observer  [cf.  the  left-hand  side 
of  (6)].  The  other  term  on  the  righthand  side  of  this  equation 
tends  to  zero  as  the  Sj  tend  to  zero.  Note  that  the  //(J  may  in 
general  depend  on  the  6k  via  (10)— (12),  but 

|  H.;:j  |  =  f  h:l(x)  (Px 

JVi 

<  f  \hj(x)\dnx 

JVi 

<  /  p(x\  t  =  ttn)  dPx 

Jv i 


Pn  N  +  &N-1  Hn(N-1) 


Pn  n  +  6 1  Hn  i 


Now  of  course  Pn  n  =  1  —  E;=i  PiN\  for  simplicity,  we  will 
write  cti  =  Pi n-  Equation  (15)  can  now  be  written  as 
Pi  2  j  I"  on  +PP12 

Pi  3  ai  +  63  Hi  3 


Thus,  the  Hjj  are  bounded,  and  will  possess  Taylor  expansions 
in  Sk  (he.,  will  not  depend  on  terms  of  the  form  for  posi¬ 
tive  integers  m).  Therefore,  operating  points  on  the  ROC  hyper- 
surface  of  a  “near-guessing”  observer  tend  continuously  toward 
points  on  the  ROC  hypersurface  of  the  “guessing”  observer. 
Note  that  the  N(N  —  1)  terms  on ,  6jH.Lj  are  not  all  independent, 
since  they  all  depend  implicitly  for  fixed  Sj  on  the  N2  —  N  —  1 
decision  parameters  7.  That  is,  the  ROC  hypersurface  given  by 
(17)  possesses  only  N2  —  N  —  1  degrees  of  freedom. 


pij  {'  /  ./ 1 


ai  +  61  Hi  1 


+  6jHij  { 1  /  j | 


1  —  1 '  °j  +  6jv-  1  Hn(n_  i  j 


1  -  E^i1  a:i  +  6iHni 


which  further  simplifies  to 
P12  1 

Pi  3 


•  N  —  1  elements 


IV.  The  Hypervolume  Under  the  ROC  Hypersurface  of 
an  TV-Class  “Near-Guessing”  Observer 

In  the  preceding  section,  it  was  shown  that  the  ROC  hyper- 
surface  of  a  “near-guessing”  observer  tends  continuously  to 
the  ROC  hypersurface  of  a  “guessing”  observer  as  the  pdfs  of 
the  observational  data  tend  arbitrarily  toward  identical  distri¬ 
butions.  Intuitively,  one  would  expect  that  the  hypervolumes 
under  these  hypersurfaces  should  also  tend  toward  each  other. 
Since  intuition  can  occasionally  be  an  unreliable  guide  in 
analyzing  TV-class  classification  tasks,  it  would  be  reassuring  if 
the  results  of  the  preceding  section  could  be  applied  directly  to 
the  calculation  of  the  relevant  hypervolumes. 

For  this  section,  we  will  write  P.;.j  as  p;/ (7),  emphasizing 
that  it  is  a  function  of  the  decision  parameters  chosen.  We,  thus, 
rewrite  (15)  to  obtain 

„  r  Pjv(7)  +  T>2Pij|7) 

p1'  T  Puv( 7) +«3Pi3( 7) 

P.3(7) 


P|7v(  7  1 


Pi  jfjfv) 


Pi  V  / ./I 


ai  >  N  —  1  elements 


1  v'vAT—  1  \ 

1  -  E,=  i  °j  ] 

1  —  y 1  nJ  >  V  —  I  elements 


P  1(7) 


Pj  (7)  {T  /  ./'  I 


Pn{  7) 


Pv,(iV-l)  (7) 


P,n{  7)  +  61H11  (7) 


=  Pjv  ( 7 )  +  Sj  H  ,j  ( 7 )  {»/))  .(19) 


Pjv(  7) 


Pn.(N-i  )  (7) 


+  ^  Sm  (17) 

./= 1 


Pjv,  1(7) 


Pv,(jv-i)(7)  -  T>,v-iP.v(,v-i)(7) 

+7  TTvi  (7) 
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To  find  the  hypervolume  under  the  ROC  sur¬ 
face  given  by  Pm  considered  as  a  function  of 
(P12,  Pis,  ■  ■  - ,  Ai,  •  •  • ,  PN(N-1),  •  •  • ,  Pm),  one  must 
evaluate  the  integral 

j...jpN1d^P.  (20) 

(The  domain  of  the  integral  is  simply  the  set  of  all  P.;.j  such  that 
PN i  is  defined.)  Note  that,  for  the  “guessing”  observer,  we  ex¬ 
pect  this  integral  to  be  zero  when  N  >  2  due  to  dimension¬ 
ality  considerations — the  ROC  hypersurface  has  only  N  —  l 
degrees  of  freedom  (cf.  (6)),  not  the  N2  —  N  —  1  required  in 
this  (N2  —  /Y)-dimcnsional  ROC  space.  To  see  this  explicitly, 
one  can  rearrange  the  order  of  integration  and  consider  the  in¬ 
nermost  integral  f  Pjvi^Av(j\T-i)  for  fixed  values  of  the  other 
misclassification  probabilities.  Then  the  limits  of  integration  of 
this  innermost  definite  integral  become,  again  by  (6) 

/  PNldPN(N-l)  {j  <  N}  (21) 

which  is  zero  by  inspection. 

We  now  return  to  the  general  case  of  a  “near-guessing”  ob¬ 
server.  One  way  to  evaluate  the  integral  in  (20)  is  to  reexpress 
it  explicitly  in  terms  of  the  decision  parameters  7,  via  the  Jaco¬ 
bian 


772 

0Pl2 

9  Pi  l 

9  Pi  2 

9 7 1 

9 12 

9  7  3 

d~  N2-N- 1 

dpi  n 

9  Pi  N 

9  Pi  N 

9  Pi  N 

9 11 

9  72 

9  Ya 

9lN2_N_1 

9Pg 

9  Pj  i 

9  Pjj 

9  Pij 

9 1 1 

9  72 

9  73 

a7jv2_jv_i 

9Pn(n- i) 

9pN(N-L) 

9p  N(N-l) 

7Pjv(jv-i) 

9  Ji 

9  72 

9  73 

9~n2-N-1 

9pN9. 

9Pn2 

°Pn  2 

9Pn9. 

9  72 

9  73 

9lN2_N_! 

1  (22) 


where  the  vertical  bars  indicate  that  the  determinant  of  the  en¬ 
closed  matrix  is  to  be  taken,  and  where  7.,  denotes  the  ?'th  com¬ 
ponent  of  7.  (We  assume  that  indices  of  the  parameters  7  have 
been  chosen  appropriately  so  that  no  negative  sign  is  introduced, 
i.e.,  volumes  remain  positive.)  For  the  “guessing”  observer,  this 
reduces  to 


Tguessing 


9  Pin 

9  Pi  N 

9  Pl  N 

9P\  N 

9  71 

9  72 

9-3 

9in2_N-i 

9  Pin 

9  pl  N 

9  Pl  N 

9  Pi  n 

9  71 

9  72. 

9  73 

9^n2—N  —  1 

9piN 

9  Pin 

9PiN 

9P,n 

^71 

9  72 

9  73 

9^N2  —N  —  l 

dPN(N  —  1) 

9Pn(N-1) 

9Pn(N-1) 

9Pn(N-  1) 

9 71 

9'Y2 

9  73 

d7N2-N- 1 

dPN(N-l) 

9Pn(N-1) 

9Pn(N-1) 

9Pn(N-1) 

9  71 

9 12 

9  73 

®T1V2-  JV_1 

"(23) 


where  PN{N_  =  PNN  =  1  -  X7I1  pjN-  For  a  “near¬ 
guessing”  observer,  we  combine  (19)  and  (22)  to  obtain 

I  .  .  .  d(P1N+62H12)  .  I 

!  djk  : 


d  t  unr  — 


d(PjN+6jHjj ) 

djk 


dPN(N-l) 

d~tk 


d{PN(N-l)—  ^N  —  iHN(N_1)+62  Hmi) 
dlk 


From  the  properties  of  determinants  [22] ,  it  can  be  shown  that, 
to  first  order  in  the  Sj, 

N-l 

Jneebi  —  7guessing  -)-  ^  '  S jjj  (25) 

j=  1 

where  the  Jj  are  bounded  and  continuous  with  respect  to  the  Sj . 

If  we  denote  the  hypervolume  under  the  ROC  hypersurface 
of  the  “guessing”  observer  by 

t  .  _  [  f  p  jN2—N—lp 

^guessing  —  /  *  *  ‘  /  1 

=  J  •••  J  Av(JV-l)(7)2gueSSingrfA'_A_17  (26) 

then  the  hypervolume  under  the  ROC  hypersurface  of  a  “near¬ 
guessing”  observer  becomes,  again  to  first  order  in  the  Sj 

Lear  =  J  “  ‘  J  [Pn,(N-1)  (j)  ~  l).V-l  L Y(.\>  - I  )  (7) 

+  SiHn  1(7)] 


J guessing  +  6jJj  7  •  •  •  dN  ~N~^  (27) 

3= 1 


—  Fguessing  7  ^  ^  Sj  Ij  7  (28) 

3= 1 

where  the  integrals  Ij  are  bounded  (i.e.,  they  may  depend  on 
higher  integral  powers  of  Sj,  but  not  on  S~m  for  positive  integers 
to).  That  is,  in  the  limit  as  the  Sj  tend  toward  zero,  /nea r  tends 
toward  JgueSsing  in  a  continuous  fashion. 


V.  The  FIypervolume  Under  the  ROC  FIypersurface  of 
an  iV-Ci.Ass  “Near-Perfect”  Observer 

In  the  preceding  sections,  we  established  that  the  hyper¬ 
volume  under  the  ROC  hypersurface  of  a  “guessing”  observer 
is  zero,  and  furthermore  that  this  result  is  not  singular:  an 
observer  in  a  “near-guessing”  task  will  achieve  a  ROC  hy¬ 
persurface  with  hypervolume  approaching  zero  continuously 
as  the  data  pdfs  approach  identity.  An  ideal  observer  in  a 
“perfect”  task — i.e.,  in  which  the  data  pdfs  never  overlap — will 
also  achieve  a  ROC  hypersurface  with  zero  hypervolume, 
because  it  can  achieve  the  operating  point  0  and,  thus,  will 
not,  for  any  rational  decision  rule,  achieve  points  interior  to 
the  unit  hypercube  defining  ROC  space.  It  is  reasonable  to  ask 
whether  “near-perfect”  observers,  performing  tasks  for  which 
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the  overlap  in  the  underlying  data  pdfs  is  nearly  negligible, 
behave  similarly  to  “near-guessing”  observers,  in  the  sense 
that  the  hypervolume  under  the  ROC  hypersurface  of  such  an 
observer  will  approach  zero  in  a  continuous  fashion. 

Consider  observational  data  x  drawn  from  N  pdfs  p(x  |  t  = 
TTj)  where  1  <  j  <  N.  We  denote  the  mean  of  p(x\t  =  TVj) 
by  jlj  and  note  that,  without  loss  of  generality,  the  mean  of 
p(x  1 1  =  7T/v)  can  be  taken  to  be  0.  Furthermore,  note  that  we 
can  apply  a  linear  transformation  to  the  data  x  and,  thus,  effec¬ 
tively  to  the  jlj,  such  that  each  of  the  resulting  /7y  is  either  1) 
mutually  orthogonal  to,  or  2)  a  scalar  multiple  of,  any  of  the 
other  /I,; .  Because  the  transformation  applied  is  linear,  the  ideal 
observer  for  this  task  will  remain  the  same,  and  hence  the  task 
itself  can  be  considered  essentially  unchanged. 

Let  us  consider  now  an  observer  for  this  task  which  is  gener¬ 
ally  not  ideal;  in  fact,  we  will  consider  only  a  single  operating 
point  achieved  by  this  observer.  The  observer  decides  d  =  TTi 
for  a  given  observation  x  if 


(x-  Mi) 


(Mj  m0 
I  —  Mil 


< 


(29) 


with  equality  for  any  such  relation  between  two  classes  being 
decided  in  an  arbitrary  but  consistent  manner.  That  is,  the  ob¬ 
server  places  hyperplanes  between  the  means  of  any  two  classes 
when  attempting  to  decide  between  those  classes  (rather  than 
placing  those  hyperplanes  in  the  likelihood  ratio  decision  vari¬ 
able  space,  as  would  the  ideal  observer). 

Now  suppose  the  task  is  made  slightly  “easier,”  while  the  ob¬ 
server  itself  remains  unchanged.  That  is,  consider  the  mean  of 
one  pdf,  say  jli  for  i  X  N,  being  increased  by  a  factor  1  +  6  for 
0  <  S  <  1,  while  the  location  of  the  decision  hyperplanes  does 
not  change,  except  in  the  special  case  where  jlj  =  ajli  for  some 
other  pdf  (again  with  j  X  N).  In  this  latter  case  we  increase  both 
means  ( jlj  =  (1  +  6)  /7y ,  /l'  =  (1  +  8)jlj),  and  the  location  of 
the  corresponding  decision  hyperplane  shifts  accordingly. 

Note  that  p!i  is  now  further  away  from  each  decision  hyper¬ 
plane  relevant  to  d  =  7 Ti  in  (29).  In  the  case  jlj  =  ajli ,  the 
decision  hyperplane  is  now  a  distance  of  | (jlj)  —  {jl'i)/{2)\  = 
(1  +  6)|(/X)  —  (/X)/(2)|  from  jlj.  For  noncollinear  jlj,  the  direc¬ 
tion  from  jl!t  to  the  decision  hyperplane  is  given  by  jlj  —  jli,  and 
since  jlj  and  jlj  are  orthogonal,  (jlj  —  jli)  •  (jlj  —  jli)  =  —  S\jli\2; 
since  this  quantity  is  negative,  it  follows  that  /7 \  is  further  from 
that  decision  plane  than  /7V; . 

It  immediately  follows  from  this  that  none  of  the  misclassifi- 
cation  probabilities  making  up  the  coordinates  of  the  observer’s 
operating  point  can  increase  when  moving  from  the  old  task  to 
the  new  one.  To  see  this,  consider  a  change  of  coordinates  in 
the  data  space  such  that  jl \  is  now  the  origin.  All  of  the  deci¬ 
sion  hyperplanes  separating  this  class  from  the  others  are  effec¬ 
tively  moving  away  from  the  center  of  its  pdf;  since  the  hyper¬ 
planes  are  translating  without  rotating,  we  see  immediately  that 
the  probability  Pa  cannot  decrease  (and  will  increase  in  gen¬ 
eral),  while  the  other  probabilities  Pj.;  ( j  ^  i )  cannot  increase 
(and  will  decrease  in  general). 

Note  that  any  pdf  p(xj  must  decrease  more  rapidly  than  \x\  ~n 
for  sufficiently  large  \x\ ,  where  n  is  the  dimensionality  of  x.  This 
allows  us  to  state  qualitatively  the  sense  in  which  the  observer 
under  consideration  is  “near-perfect”:  we  hypothesize  that  the 


Fig.  1 .  Operating  point  of  an  observer  in  a  two-class  classification  task  with 
coordinates  (FPFo,FNFo),  denoted  by  the  point  at  the  lower  left  comer  of 
the  crosshatched  region.  Since  no  rational  observer  will  achieve  points  in  the 
crosshatched  region,  the  area  under  this  observer’s  ROC  curve  cannot  be  greater 
than  1  -  (1  -  FPF0)(1  -  FNF0). 


|/X|  are  all  sufficiently  large  that  this  limiting  condition  is  met. 
Given  this  condition,  the  only  situation  in  which  an  error  prob¬ 
ability  Pji  ( j  ^  i)  will  fail  to  decrease  is  if  this  probability  is 
already  zero.  By  allowing  all  of  the  |/7,;  |  to  increase  in  the  manner 
described  above,  we  can  clearly  obtain  in  general  a  situation  in 
which  each  of  the  misclassification  probabilities  is  either  de¬ 
creasing,  or  equal  to  zero. 

This  implies  that  the  hypervolume  under  the  ROC  hypersur¬ 
faces  of  the  observers  under  consideration  (however  we  chose 
to  define  their  decision  rules  for  operating  points  other  than 
those  described  above)  must  also  decrease  as  the  task  is  made 
“easier”  as  described  above.  To  see  this,  note  that  if  a  given 
observer  achieves  an  operating  point  P  on  its  ROC  hypersur¬ 
face,  it  cannot  achieve  another  point  P'  such  that  the  compo¬ 
nents  of  these  points  satisfy  Pj  >  P;(l  <  i  <  N2  —  N )  (be¬ 
cause  such  an  observer  could  be  replaced  by  an  observer  which 
achieved  P  for  all  such  points  by  using  the  original  decision 
rule  for  the  point  P,  thereby  achieving  unambiguously  better 
performance  at  those  points).  Thus,  knowing  that  a  given  ob¬ 
server  achieves  an  operating  point  of  P  implies  that  that  ob¬ 
server’s  ROC  hypersurface  must  have  a  hypervolume  under  it 
of  no  greater  than  i-rtTr^i  —  Pi);  as  the  (nonzero)  Pi  de¬ 
crease,  this  upper  limit  on  the  hypervolume  must  also  decrease 
to  zero.  This  point  is  illustrated  in  Fig.  1  for  the  two-class  case; 
here  the  observer’s  false-negative  fraction,  FNFo,  corresponds 
to  P21,  and  the  false-positive  fraction,  FPFo,  corresponds  to 
P12. 


To  summarize,  we  have  shown  that  the  known  operating  point 
of  our  simple  observer  will  move  closer  to  the  origin  for  arbi¬ 
trary  data  pdfs  as  those  pdfs  are  moved  further  apart  (i.e.,  as 
the  underlying  task  is  made  “easier”),  implying  that  the  hyper- 
volume  under  its  ROC  hypersurface  will  also  converge  to  zero. 
In  fact,  reasoning  as  above,  one  can  see  that  the  ideal  observer 
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will  also  be  unable  to  achieve  operating  points  within  the  re¬ 
gion  P!  >  P,  ( 1  <  /  <  N2  —  N ),  since  the  ideal  observer’s 
ROC  hypersurface  is  never  above  that  of  any  other  observer  at 
any  given  point  in  the  domain  of  the  ROC  space  LI 5].  The  hy¬ 
pervolume  under  the  ideal  observer’s  ROC  hypersurface  will, 
thus,  also  converge  to  zero  as  the  underlying  data  pdfs  are  moved 
apart. 

VI.  Conclusion 

In  .V-class  classification  tasks  where  TV  >  2,  it  can  be  shown 
that  the  hypervolume  under  the  ROC  hypersurface  of  both  the 
“guessing”  observer  and  the  “perfect”  observer  are  zero.  More 
importantly,  we  have  shown  in  each  of  these  performance  ex¬ 
tremes  that  the  convergence  to  zero  is  smooth  rather  than  discon¬ 
tinuous.  This  convergence  can  be  considered  completely  gen¬ 
eral  for  “near-guessing”  observers  and  generally  true  for  “near- 
perfect”  observers  which  follow  rational  decision  rules  (analo¬ 
gous  to  false-negative  fraction  and  false-positive  fraction  being 
monotonically  related  in  a  two-class  task);  that  is,  the  conclu¬ 
sions  appear  to  hold  true  for  arbitrary  underlying  data  pdfs. 

In  the  two-class  classification  task,  the  area  under  the  ROC 
curve  (AUC)  is  considered  a  useful  performance  metric  for  a  va¬ 
riety  of  reasons.  One  of  the  most  pleasing  and  straightforward 
of  these  is  the  simple  relationship  between  AUC  and  the  “sep¬ 
arability”  of  the  two  underlying  data  pdfs  (i.e.,  the  difficulty  of 
the  task).  Namely,  the  AUC  (with  the  two-class  ROC  defined  as 
a  plot  of  false-negative  fraction  versus  false-positive  fraction) 
of  a  “perfect”  observer  is  zero,  and  increases  in  some  sense  uni¬ 
formly  as  the  task  is  made  more  difficult,  until  one  arrives  at  the 
“guessing”  observer  with  an  AUC  of  0.5.  In  an  A-class  classi¬ 
fication  task,  this  straightforward  relationship  appears  to  break 
down,  and  both  “perfect”  and  “guessing”  observers  yield  ROC 
hypersurfaces  with  zero  hypervolume.  It  would  appear  that,  due 
to  this  ambiguity,  hypervolume  under  the  ROC  hypersurface  of 
an  , V-class  observer  is  not  a  useful  performance  metric:  Does 
a  hypervolume  of  0.005  indicate  an  observer  faced  with  an  ex¬ 
ceptionally  difficult  or  exceptionally  easy  task?  One  hopes  that 
some  other  performance  metric  from  two-class  classification 
can  be  generalized  usefully  for  A-class  classification;  perhaps 
a  quantity  which  is  equal  to  AUC  in  the  two-class  case  has  a 
generalization  which  is  not  equal  to  the  hypervolume,  but  can 
be  shown  to  be  of  use  for  other  reasons. 
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ABSTRACT 

Bayesian  artificial  neural  networks  (BANNs)  have  proven  useful  in  two-class  classification  tasks,  and  are  claimed 
to  provide  good  estimates  of  ideal-observer-related  decision  variables  (the  a  posteriori  class  membership  probabil¬ 
ities).  We  wish  to  apply  the  BANN  methodology  to  three-class  classification  tasks  for  computer-aided  diagnosis, 
but  we  currently  lack  a  fully  general  extension  of  two-class  receiver  operating  characteristic  (ROC)  analysis  to 
objectively  evaluate  three-class  BANN  performance.  It  is  well  known  that  “the  likelihood  ratio  of  the  likelihood 
ratio  is  the  likelihood  ratio.”  Based  on  this,  we  found  that  the  decision  variable  which  is  the  a  posteriori  class 
membership  probability  of  an  observational  data  vector  is  in  fact  equal  to  the  a  posteriori  class  membership 
probability  of  that  decision  variable.  Under  the  assumption  that  a  BANN  can  provide  good  estimates  of  these  a 
posteriori  probabilities,  a  second  BANN  trained  on  the  output  of  such  a  BANN  should  perform  very  similarly  to 
an  identity  function.  We  performed  a  two-class  and  a  three-class  simulation  study  to  test  this  hypothesis.  The 
mean  squared  error  (deviation  from  an  identity  function)  of  a  two-class  BANN  was  found  to  be  2.5  x  10-4.  The 
mean  squared  error  of  the  first  component  of  the  output  of  a  three-class  BANN  was  found  to  be  2.8  x  10-4,  and 
that  of  its  second  component  was  found  to  be  3.8  x  10-4.  Although  we  currently  lack  a  fully  general  method  to 
objectively  evaluate  performance  in  a  three-class  classification  task,  circumstantial  evidence  suggests  that  two- 
and  three-class  BANNs  can  provide  good  estimates  of  ideal-observer-related  decision  variables. 

Keywords:  Bayesian  artificial  neural  networks,  ideal  observers,  three-class  classification 

1.  INTRODUCTION 

In  the  past,  computerized  methods  for  the  detection1"5  and  classification6"11  of  mammographic  mass  lesions  have 
been  investigated  at  the  University  of  Chicago.  The  classification  scheme  currently  analyzes  lesions  which  have 
been  manually  identified  by  a  radiologist.  We  are  attempting  to  develop  a  fully  automated  classification  scheme 
by  combining  the  existing  detection  and  classification  schemes;  we  have  argued  previously12  that  this  will  require 
a  three-class  classifier  to  account  for  the  presence  of  false-positive  (FP)  computer  detections,  in  addition  to  the 
malignant  and  benign  lesions,  in  the  output  of  the  detection  scheme. 

For  some  time  now  we  have  explored  the  use  of  Bayesian  artificial  neural  networks  (BANNs)  for  a  variety  of 
detection5, 13, 14  and  classification11  tasks  in  computer-aided  diagnosis  (CAD).  Our  motivation  for  investigating 
BANNs  is  based,  first,  on  our  theoretical  observation  that,  in  the  limit  of  infinite  training  data,  a  BANN  will 
yield  an  ideal  observer  decision  function  for  that  data  population; 15  and  second,  on  empirical  observations 
that  even  given  a  finite  sample  of  training  data,  a  BANN  can  estimate  an  ideal  observer  decision  function 
reasonably  well.16  (We  note  that  the  BANN  implementation  we  are  using  is  that  of  MacKay,17  which  employs  a 
multivariate  normal  function  for  the  prior  distribution  on  the  network  weight  values.)  We  have  also  performed 
simulation  studies  showing  that  BANNs  can  accurately  estimate  ideal  observer  decision  variables  in  a  three-class 
classification  task.15  Moreover,  we  showed  recently  that  a  three-class  BANN  could  produce  decision  variables  for 
actual  mammographic  mass  lesion  feature  data,  and  that  these  decision  variables  are  related  to  two-class  BANN 
decision  variable  data  in  a  particular  way  consistent  with  a  theoretical  relationship  between  three-class  and  two- 
class  ideal  observer  decision  variables.12  We  consider  this  to  be  strong  circumstantial  evidence  for  the  ability 
of  a  BANN  to  estimate  three-class  ideal  observer  decision  variables,  though  we  currently  lack  a  fully  general 
method  for  evaluating  three-class  classifiers  (i.e.,  a  three-class  extension  to  receiver  operating  characteristic 
(ROC)  analysis). 
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In  this  work,  we  present  further  circumstantial  evidence  toward  the  claim  that  a  BANN  can  provide  good 
estimates  of  three-class  ideal  observer  decision  variables.  We  develop  a  theoretical  relationship  between  the 
a  posteriori  class  membership  probabilities  of  a  given  observational  data  variable  and  the  a  posteriori  class 
membership  probabilities  of  those  a  posteriori  probabilities  treated  as  a  set  of  observational  data  in  their  own 
right.  (It  is  known  that  a  posteriori  class  membership  probabilities  are  equivalent  to  ideal  observer  decision 
variables  in  a  two-class  task,16  and  related  in  a  straightforward  way  to  the  ideal  observer  decision  variables  in  a 
task  with  three  or  more  classes.15)  We  then  describe  simulation  studies  to  train  and  test  a  set  of  BANNs,  and 
present  results  of  such  a  simulation  study  verifying  that  the  BANNs  we  examined  did  indeed  obey  the  theoretical 
relationship  predicted  for  ideal  observer  decision  variables,  to  within  experimental  error.  In  the  final  section,  we 
present  our  conclusions  drawn  from  this  work. 


2.  THEORY 

It  is  well  known  that  the  ideal  observer  decision  variable,  i.e.,  the  likelihood  ratio  or  any  monotonic  transformation 
of  this  value,  yields  optimal  performance  in  a  two-class  classification  task. 18  It  can  also  be  shown,  in  a  classification 
task  with  N  classes  (N  >  2),  that  the  ideal  observer  decision  rule  becomes  more  complicated  than  a  simple 
threshold  on  a  single  decision  variable,  but  that  the  optimal  decision  variables  remain  a  set  of  N  —  1  likelihood 
ratios.18, 19 

We  can  define  the  ith  likelihood  ratio  as 


A,  =  LR.j(x) 


p(x|ttat)  ’ 


(1) 


where  x  represents  statistically  variable  observational  data  (which  we  assume  to  have  dimensionality  n),  and 
7 tj  represents  one  of  the  N  classes  from  which  the  data  are  drawn  (here  1  <  i  <  N  —  1).  Clearly  the  vector 
(of  dimensionality  N  —  1)  of  decision  variables  A,;  is  itself  statistically  variable,  and  one  might  ask  what  the 
likelihood  ratios  of  these  variables  are.  In  fact,20 

*A>->  =  /  ■■/e®*''-  ••<** 

=  /" '/  dx'N ' ' ' dxH ’  (2) 

where  we  have  assumed  that  N  —  1  <  n;  if  N  —  1  =  n,  then  no  integration  is  performed.  (If  N  —  1  >  n,  then 
at  least  one  of  the  likelihood  ratio  decision  variables  will  be  expressible  as  a  function  of  the  others;  we  will  not 
consider  this  degenerate  case  here.)  The  sum  is  over  all  solutions  to  Eq.  1  for  a  given  A;  this  yields 
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the  source  of  the  well-known  adage  that  “the  likelihood  ratio  of  the  likelihood  ratio  is  the  likelihood  ratio.” 

Consider  now  a  different  set  of  decision  variables,  the  a  posteriori  class  membership  probabilities  considered 
as  functions  of  the  statistically  variable  observational  data 


Yi  =  |x). 


(4) 


(Since  P(ttn\x)  =  1  —  P(ni\x),  we  still  have  N  —  1  decision  variables.)  Note  that  in  a  two-class  classification 

task,  this  decision  variable  is  known  to  be  a  monotonic  function  of  the  likelihood  ratio,  and  is  therefore  an  ideal 
observer  decision  variable;16  while  in  a  classification  task  with  more  than  two  classes,  the  a  posteriori  class 
membership  probabilities  can  be  shown  to  be  related  to  the  likelihood  ratios  in  a  straightforward  way. 15 

Reasoning  as  above,  we  may  ask  what  the  a  posteriori  class  membership  probability  of  these  decision  variables, 
or  P(7Tj|y),  is.  In  fact, 
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LRj(x)  P  (n  j)/P  (it  N) 

1  +  EfcLi1  LRfe(f)P(7rfc)/P(7rjv) 


and  this  relation  can  also  be  inverted  to  yield 
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We  again  start  with  Eq.  2,  this  time  obtaining 
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where  the  sums  in  j  are  over  all  solutions  to  Eq.  4  for  a  given  y.  (The  fraction  can  be  taken  out  of  the  integral 
because  the  relations  in  Eqs.  5  and  6  are  one-to-one,  and  thus  the  set  of  all  solutions  to  Eq.  4  correspond  to  a 
single  value  of  LR,(T)).)  This  again  yields 


LR,  iy)  —  I .  R,  ( Xj ) 


(8) 


where  y  is  the  vector  of  a  posteriori  class  membership  probabilities  of  x  from  Eq.  4,  and  Xj  is  any  solution  to 
that  equation  for  a  given  y. 

It  follows  that 

p/,^  =  LRj(y)P(7ri)/P(7rjv) 

l  +  Ef=“i1LRfe(^P(7rfe)/P(7rjv) 

LRj(£j)P(7Ti)/P(7rjv) 

1  +  EfcTi1  LRfc(fi)P(7Tfc)/P(7rAr) 

=  P(nt\xj)  =  yi,  (9) 


where  Xj  is  again  any  solution  to  Eq.  4  for  a  given  y.  This  shows  that  a  similar  adage  to  that  for  likelihood  ratios 
holds  true,  namely  that  “the  a  posteriori  class  probabilities  of  the  (data)  a  posteriori  class  probabilities  are  the 
(data)  a  posteriori  class  probabilities.” 


3.  MATERIALS  AND  METHOD 


We  have  shown  in  the  past16  that  a  BANN  can  provide  good  estimates  of  the  a  posteriori  class  membership 
probabilities  in  a  two-class  classification  task,  and  we  have  presented  the  results  of  simulation  studies15  and 
experiments  with  real  mammographic  feature  data12  strongly  suggesting  that  the  same  holds  true  for  three-class 
BANNs  as  well.  The  theoretical  relationship  given  by  Eq.  9,  derived  in  the  preceding  section,  provides  a  basis 
for  another  simulation  study  which  should  provide  further  circumstantial  evidence  for  the  claim  that  two-class 
and  three-class  BANNs  can  provide  good  estimates  of  the  two-  and  three-class  a  posteriori  class  membership 
probabilities  (directly  related  to  the  ideal  observer  decision  variables  via  Eq.  5),  respectively. 

Specifically,  for  the  two-class  simulation  study,  we  drew  500  samples  pseudorandomly  from  each  of  two 
distributions: 


p(x |7Ti)  =  N(x;  pi  =  1,  a\  =  2)  (10) 

p(x\n2)  =  N(x;H2  =  0,o|  =  1).  (11) 

We  then  trained  a  two-class  BANN  with  one  input,  five  hidden  units,  and  one  output  on  this  data,  obtaining  a 
classifier  we  denote  by 

y  =  Bl(x).  (12) 

(The  superscript  denotes  the  number  of  classes  being  classified.)  We  then  used  this  output,  given  the  known 
truth  states  for  the  original  observations  x  from  which  it  was  obtained,  as  training  data  for  a  second  BANN  with 
one  input,  five  hidden  units,  and  one  output: 

z  =  B%(y).  (13) 

Finally,  we  pseudorandomly  sampled  an  independent  testing  set  of  500  observations  x  from  each  of  the  two 

classes  given  in  Eqs.  10  and  11.  This  testing  set  was  used  as  input  to  the  first  BANN  to  obtain  a  testing  set 

ytest ;  this  in  turn  was  given  as  input  to  the  second  BANN,  for  which  the  output  was  ztest. 

Given  Eq.  9,  together  with  the  assumption  that  an  adequately  trained  two-class  BANN  yields  good  estimates 
of  the  a  posteriori  class  membership  probabilities  of  the  observations  being  classified,  it  should  be  the  case  that 
ztest  estimates  ytest  at  least  to  within  experimental  error.  To  verify  this,  we  plotted  2test  as  a  function  of  ytest 
for  each  of  the  two  classes,  and  we  computed  the  mean  squared  error 


MSE2  = 


1000 


£(; 


-  ytest)  , 


(14) 


where  the  sum  is  over  all  the  observations  in  the  two  classes. 

Similarly,  for  the  three-class  simulation  study,  we  drew  500  two-dimensional  samples  pseudorandomly  from 
each  of  three  distributions: 
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We  then  trained  a  three-class  BANN  with  two  inputs,  five  hidden  units,  and  two  outputs  on  this  data,  obtaining 
a  classifier  we  denote  by 

y  =  B\{x).  (18) 

We  then  used  this  output,  given  the  known  truth  states  for  the  original  observations  x  from  which  it  was  obtained, 
as  training  data  for  a  second  BANN  with  two  inputs,  five  hidden  units,  and  two  outputs: 


z  =  B%(y). 


(19) 


Figure  1.  Output  of  the  second  two-class  BANN  as  a  function  of  its  input  for  the  observations  actually  drawn  from  class 
7n  in  the  two-class  simulation  study. 

Finally,  we  pseudorandomly  sampled  an  independent  testing  set  of  500  observations  x  from  each  of  the  three 
classes  given  in  Eqs.  15-17.  This  testing  set  was  used  as  input  to  the  first  BANN  to  obtain  a  testing  set  ytest; 
this  in  turn  was  given  as  input  to  the  second  BANN,  for  which  the  output  was  z  test. 

Again,  given  Eq.  9,  together  with  the  assumption  that  an  adequately  trained  two-class  BANN  yields  good 
estimates  of  the  a  posteriori  class  membership  probabilities  of  the  observations  being  classified,  it  should  be  the 
case  that  z*est  estimates  y\est,  and  ^2est  estimates  at  least  to  within  experimental  error.  To  verify  this,  we 
plotted  z}est  as  a  function  of  y\est,  and  as  a  function  of  y^est,  for  each  of  the  three  classes,  and  we  computed 
the  mean  squared  errors 

MSE3i  =  iko  5>*est  -  ^test)2>  (20) 

{i  :  1,2},  where  the  sum  is  over  all  the  observations  in  the  three  classes. 

4.  RESULTS 

Figure  1  shows  ztest  as  a  function  of  ytest  for  the  observations  in  class  7Ti,  and  Fig.  2  shows  ztest  as  a  function 
of  ytest  for  the  observations  in  class  tt-2  from  the  two-class  simulation  study.  The  mean  squared  error  for  the 
complete  set  of  1000  observations  was  2.5  x  10-4. 

Figure  3  shows  the  components  of  z  test  as  a  function  of  the  corresponding  components  of  ytest  for  the 
observations  in  class  ni.  Similarly  Fig.  4  shows  the  components  of  z  test  as  a  function  of  the  corresponding 
components  of  y  test  for  the  observations  in  class  7r2,  and  Fig.  5  shows  the  components  of  z  test  as  a  function  of 
the  corresponding  components  of  y  test  for  the  observations  in  class  The  mean  squared  error  for  the  complete 
set  of  1500  observations  was  2.8  x  10-4  for  the  first  component  and  3.8  x  10-4  for  the  second  component. 

5.  DISCUSSION  AND  CONCLUSIONS 

We  developed  a  theoretical  relationship  between  the  a  posteriori  class  membership  probabilities,  directly  related 
to  ideal  observer  decision  variables,  and  the  a  posteriori  class  membership  probabilities  of  those  a  posteriori 
class  membership  probabilities  treated  as  statistically  variable  observer  data  in  their  own  right.  The  identity 
relationship  found  is,  perhaps  unsurprisingly,  quite  similar  in  spirit  to  the  identity  relationship  between  the 
likelihood  ratio  decision  variables  and  the  likelihood  ratio  of  those  likelihood  ratio  decision  variables  for  a  given 
task. 


Figure  2.  Output  of  the  second  two-class  BANN  as  a  function  of  its  input  for  the  observations  actually  drawn  from  class 
7T2  in  the  two-class  simulation  study. 


(a)  (b) 


Figure  3.  The  (a)  first  and  (b)  second  components  of  the  output  of  the  second  three-class  BANN  as  a  function  of  the 
corresponding  component  of  its  input  for  the  observations  actually  drawn  from  class  tti  in  the  three-class  simulation  study. 


(a) 


(b) 


Figure  4.  The  (a)  first  and  (b)  second  components  of  the  output  of  the  second  three-class  BANN  as  a  function  of  the 
corresponding  component  of  its  input  for  the  observations  actually  drawn  from  class  n2  in  the  three-class  simulation  study. 


Figure  5.  The  (a)  first  and  (b)  second  components  of  the  output  of  the  second  three-class  BANN  as  a  function  of  the 
corresponding  component  of  its  input  for  the  observations  actually  drawn  from  class  7T3  in  the  three-class  simulation  study. 


We  currently  lack  a  fully  general  method  for  three-class  classification  or  for  practically  evaluating  the  perfor¬ 
mance  of  a  three-class  classifier.  As  a  first  step  toward  such  a  classification  method,  we  are  investigating  the  use 
of  BANNs  to  estimate  three-class  ideal  observer  decision  variables  for  such  a  task.  Since,  in  a  practical  situation, 
we  will  not  have  access  to  the  underlying  probability  distributions  from  which  the  observational  data  are  drawn, 
we  must  rely  on  circumstantial  evidence  in  support  of  our  claim  that  a  three-class  BANN  can  adequately  estimate 
decision  variables  directly  related  to  ideal  observer  decision  variables. 

Previously,  we  presented  work  relating  the  output  of  a  three-class  BANN  to  the  outputs  of  two-class  BANNs 
trained  for  various  “simplified”  cases  in  which  the  three-class  classification  task  was  reduced  to  a  two-class 
classification  task,  and  showed  that  the  relationships  found  were  consistent  with  the  relationship  between  three- 
and  two-class  ideal  observers  for  the  same  tasks.12  In  the  present  work,  we  showed  that  the  output  of  two-  and 
three-class  BANNs  was  consistent,  to  within  experimental  error,  with  the  theoretical  relationship  developed  for 
actual  a  posteriori  class  membership  probabilities.  This  is  of  limited  practical  use  in  the  complete  development  of 
a  three-class  classifier,  mainly  because  the  three-class  ideal  observer  decision  rule  is  considerably  more  complicated 
than  its  two-class  counterpart  (a  simple  threshold  on  a  single  decision  variable).  It  does,  however,  bolster  our 
confidence  in  the  choice  of  the  BANN  as  an  appropriate  tool  for  estimating  the  decision  variables  which  would 
eventually  be  incorporated  in  such  a  classifier. 
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ABSTRACT 

We  analyzed  a  variety  of  recently  proposed  decision  rules  for  three-class  classification  from  the  point  of  view  of 
ideal  observer  decision  theory.  We  considered  three-class  decision  rules  which  have  been  proposed  recently:  one 
by  Scurfield,  one  by  Chan  etal.,  and  one  by  Mossman.  Scurfield’s  decision  rule  can  be  shown  to  be  a  special 
case  of  the  three-class  ideal  observer  decision  rule  in  two  different  situations:  when  the  pair  of  decision  variables 
is  the  pair  of  likelihood  ratios  used  by  the  ideal  observer,  and  when  the  pair  of  decision  variables  is  the  pair  of 
logarithms  of  the  likelihood  ratios.  Chan  etal.  start  with  an  ideal  observer  model,  where  two  of  the  decision 
lines  used  by  the  ideal  observer  overlap,  and  the  third  line  becomes  undefined.  Finally,  we  showed  that  the 
Mossman  decision  rule  (in  which  a  single  decision  line  separates  one  class  from  the  other  two,  while  a  second  line 
separates  those  two  classes)  cannot  be  a  special  case  of  the  ideal  observer  decision  rule.  Despite  the  considerable 
difficulties  presented  by  the  three-class  classification  task  compared  with  two-class  classification,  we  found  that 
the  three-class  ideal  observer  provides  a  useful  framework  for  analyzing  a  wide  variety  of  three-class  decision 
strategies. 

Keywords:  ROC  analysis,  three-class  classification,  ideal  observer  decision  rules 

1.  INTRODUCTION 

We  are  attempting  to  develop  a  fully  automated  mass  lesion  classification  scheme  for  computer-aided  diagnosis 
(CAD)  in  mammography.  This  scheme  will  combine  two  schemes  developed  at  the  University  of  Chicago:  one  for 
automatically  detecting  mass  lesions  in  mammograms, 1-5  and  one  for  classifying  known  lesions  as  malignant  or 
benign. 6-10  Combining  these  two  types  of  CAD  scheme  is  inherently  difficult,  because  the  output  of  the  detection 
scheme  will  necessarily  include  false-positive  (FP)  computer  detections  in  addition  to  the  malignant  and  benign 
lesions  to  be  classified.  These  FP  computer  detections  correspond  to  objects  which  were  by  design  not  included 
in  the  training  sample  of  the  classification  scheme,  because  they  are  not  members  of  the  data  population  (benign 
and  malignant  mass  breast  lesions)  for  which  the  classification  scheme  was  created.  It  is  clear  then  that  the 
detection  scheme’s  output  cannot  be  used  unmodified  as  the  input  to  the  classification  scheme. 

Our  approach  has  been  to  treat  this  problem  explicitly  as  a  three-class  classification  task.  That  is,  the 
outputs  of  the  detection  scheme  should  be  classified  as  malignant  lesions,  benign  lesions,  and  non- lesions  (FP 
computer  detections),  and  the  classifier  to  be  estimated  is  the  ideal  observer  decision  rule  for  this  task.  Such 
an  approach  presents  considerable  difficulties  of  its  own.  On  the  one  hand,  decision  rules,  in  particular  ideal 
observer  decision  rules,  increase  rapidly  in  complexity  with  the  number  of  classes  involved.  On  the  other  hand,  a 
fully  general  performance  evaluation  method,  such  as  a  three-class  extension  of  receiver  operating  characteristic 
(ROC)  analysis,  has  yet  to  be  developed. 

The  explicit  form  of  the  ideal  observer  in  a  three-class  classification  task  has  been  known  for  some  time.11 
For  the  reasons  just  stated,  however,  a  practical  method  for  estimating  and  evaluating  observer  performance 
based  on  an  ideal  observer  model  has  proven  elusive,  despite  the  success  of  the  two-class  binormal  ideal  observer 
model.12  Nevertheless,  pragmatic  observer  decision  rule  models  for  three-class  classification  tasks  have  been 
proposed  relatively  recently  by  several  groups  of  researchers.  In  some  cases,  these  models  are  motivated  more 
by  considerations  of  tractability  than  of  complete  generality.  This  is  of  course  understandable  given  the  inherent 
difficulties  of  three-class  classification;  however,  we  thought  it  might  be  of  interest  to  analyze  a  number  of  recently 
proposed  three-class  decision  rule  models  within  an  ideal  observer  decision  rule  framework. 
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In  the  next  section,  we  review  the  three-class  ideal  observer  decision  rule.  In  the  following  three  sections,  we 
review  recently  proposed  three-class  decision  rule  models:  one  by  Scurfield,13  one  by  Chan  etal.,14  and  one  by 
Mossman.15  In  each  case,  the  given  decision  rule  is  analyzed  in  terms  of  the  ideal  observer  decision  rule;  where 
necessary  or  expedient,  assumptions  are  made  about  the  observer’s  decision  variables  in  order  to  facilitate  this 
analysis.  We  emphasize  that  we  do  not  attempt  a  review  of  the  experimental  methods  in  the  works  discussed; 
we  are  specifically  interested  only  in  the  form  of  the  decision  rule  which  serves  as  the  starting  point  for  each 
work.  The  results  of  our  analyses  are  briefly  summarized  in  Sec.  6. 

2.  THE  THREE-CLASS  IDEAL  OBSERVER 

It  can  be  shown11’ 16  that  an  TV-class  ideal  observer  makes  decisions  regarding  statistically  variable  observations 
x  by  partitioning  a  likelihood  ratio  decision  variable  space,  where  the  boundaries  of  the  partitions  are  given  by 
hyperplanes: 
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Here  Ut\3  is  the  utility  of  deciding  an  observation  is  from  class  7Tj  given  that  it  is  actually  from  class  ttj,  and  the 
TV  —  1  likelihood  ratios  are  defined  as 

lr,  =  = Hi  (3) 

Px(x  |t  =  nN) 

for  i  <  TV.  We  also  define  the  actual  class  (the  “truth”)  to  which  an  observation  belongs  as  t,  and  the  class  to 
which  it  is  assigned  (the  “decision”)  as  d,  where  t  and  d  can  take  on  any  of  the  values  7Ti, . . . ,  7r», . . .  ,7Tjv,  the 
labels  of  the  various  classes.  (We  use  boldface  type  to  denote  statistically  variable  quantities.) 

The  partitioning  of  the  decision  variable  space  is  determined  by  the  parameters 

7 ijk  =  ( Ui\k  -  Uj\k)P(t  =  7Tfc),  (4) 

with  i,  j,  and  k  varying  from  1  to  TV ,  and  j  ^  i.  Note  that  these  parameters  are  not  independent,  however, 
because 

7 ijk  =  7k  jk  7  kik-  (T>) 


We  can  impose  the  reasonable  condition  that  the  utility  for  correctly  classifying  an  observation  from  a  given 
class  should  be  greater  than  any  utility  for  incorrectly  classifying  an  observation  from  the  same  class,  i.e., 
Um  >  Uj\i  This  gives,  for  j  ^  i, 

7 iji  ^  0,  (6) 

leaving  TV(TV  —  1)  parameters  (the  rest  are  derivable  from  Eq.  5). 

Finally,  note  that  the  hyperplanes  represented  by  Eqs.  1  and  2  are  unchanged  if  we  multiply  all  of  these 
equations  by  a  single  scalar,  such  as  1  / liji)-  This  leaves  us  with  TV 2  —  TV  —  1  degrees  of  freedom,  as 
expected. 

The  behavior  of  a  three-class  ideal  observer  is  completely  determined  by  the  three  decision  boundary  lines 


7121LR1  — 

7212LR2 

=  7313  —  7323 

(7) 

7131LR1  +  (7232 

—  7212)LR.2 

=  7313 

(8) 

(7131  —  7i2i)LRi  + 

7232LR2 

=  7323? 

(9) 

Figure  1.  Example  three-class  ideal  observer  decision  rule,  given  the  values  of  the  decision  parameters  7121  =  7212  =  3/14 
and  7131  =  7313  =  7232  =  7323  =  1/7.  Note  7 m  =  (Ut 7  -  Uj\i)P (t  =  7 xk). 


which  we  call,  respectively,  the  “1-US.-2”  line,  the  “1-US.-3”  line,  and  the  “ 2-VS.-3 ”  line.  Note  that  if  any  two 
of  these  lines  intersect,  the  third  line  must  also  share  this  intersection  point.  We  also  emphasize  the  simple 
interpretation,  from  Eq.  4,  of  each  of  the  7q,  parameters  appearing  in  these  decision  boundary  line  equations 
as  the  difference  in  utilities  between  a  “correct”  and  one  particular  “incorrect”  decision  (scaled  by  the  a  priori 
probability  of  the  true  class  in  question) ;  and  of  each  difference  in  the  7^7  parameters  as  a  difference  in  utilities 
between  two  possible  “incorrect”  decisions  (again  scaled  by  the  a  priori  probability  of  the  true  class  in  question) . 

An  example  ideal  observer  decision  rule  for  particular  values  of  the  utilities  [/,u,  and  hence  of  the  parameters 
7 iji,  is  shown  in  Fig.  1.  Here  we  have  chosen  7121  =  7212  =  3/14  and  7131  =  7313  =  7232  =  7323  =  1/7,  yielding 
the  decision  boundary  lines 


nLR-  - 

-nLR*  = 

0 

{“1-W.-2”} 

(10) 

yLRi  - 

-nLR*  = 

1 

7 

{“l-vs.-3”} 

(11) 

-nLRl 

+  yLR2  = 

1 

7 

{“2-WS.-3”}. 

(12) 

These  simplify  to  the  equations  LR2  =  LRi,  LR2  =  2LRi  —  2,  and  LR2  =  LRi/2  +  1,  respectively. 

3.  THE  SCURFIELD  DECISION  RULE 

Scurfield  investigated  a  decision  rule  applied  to  two-dimensional  statistically  variable  data  (y  =  (yi,y2))  drawn 
from  three  classes.13  The  application  domain  was  human  observer  performance  modeling  for  acoustical  psy¬ 
chophysics  experiments.  (In  prior  work,  Scurfield  investigated  a  decision  rule  for  three-class  classification  of 
univariate  data.1'  We  will  not  review  that  prior  work  here,  because  at  present  we  are  interested  in  relating  given 
observer  models  to  the  three-class  ideal  observer  model  for  multivariate  observational  data,  which  yield  two- 
dimensional  decision  variable  data  by  Eq.  3.)  In  Scurfield’s  work,  no  assumptions  are  made  about  the  decision 
variables  yj  and  y2;  in  particular,  these  decision  variables  are  not  assumed  to  be  related  in  any  way  to  an  ideal 
observer  model.  This  is  entirely  appropriate  given  the  nature  of  the  problem  domain  Scurfield  investigated  — 
i.  e.,  human  observer  performance  modeling.  It  can  readily  be  shown,  however,  that  if  one  chooses  to  make  such 
assumptions,  special  cases  of  the  Scurfield  model  are  in  fact  special  cases  of  an  ideal  observer  decision  rule. 


Figure  2.  Decision  rule  investigated  by  Scurfield,  for  the  decision  parameters  71  and  72. 


The  Scurfield  decision  rule  is  dependent  on  two  decision  parameters,  which  we  will  call  71  and  72.  The 
decision  rule  can  be  written  as 


decide 

d  =  7Tl 

iff 

2/1 

■as 

to 

IV 

V 

-  72 

and 

2/i 

> 

7R 

(13) 

decide 

CM 

II 

iff 

2/i 

-  2/2  <  71 

-  72 

and 

2/2 

> 

72; 

(14) 

decide 

d=  7T3 

iff 

2/i  <  7i 

and 

2/2 

< 

72- 

(15) 

This  decision  rule  is  illustrated  in 

Fig.  2. 

From  these  relations,  one  can 

define  the  decision  boundary  lines 

2/i  - 

2/2 

= 

to 

{  “1-vs. 

-2”} 

(16) 

Vi 

= 

71 

-3”} 

(17) 

yi 

= 

72 

{  “2-  vs. 

-3”}. 

(18) 

Note  the  similarity  in  form  between  these  equations  and  Eqs.  7-9.  If  we  choose  yx  =  LRi(x)  and  y2  =  LR2(x) 
for  some  set  of  observational  data  x,  we  have  a  special  case  of  Eqs.  7-9,  which  is  illustrated  in  Fig.  3. 

A  second  correspondence  between  Scurfield’s  decision  rule  and  the  ideal  observer  decision  rule  can  be  obtained 
by  taking  y1  =  log(LRi(x))  and  y2  =  log(LR2(x));  note  that  a  line  of  the  form  log(LR2)  =  log(LRi)  +  a 
corresponds  to  a  line  of  the  form  LR2  =  /3LRi  for  appropriate  constants  a  and  /3.  By  inspection,  this  is  again  a 
special  case  of  Eqs.  7-9,  which  is  illustrated  in  Fig.  4. 

Scurfield  points  out13  that  the  observer  which  maximizes  Pc,  the  “percent  correct”  or  probability  of  a 
correct  response,  is  a  special  case  of  the  ideal  observer  (i.e.,  a  single  operating  point  achievable  by  the  ideal 
observer  for  the  given  task).  This  observer  follows  the  Scurfield  decision  rule  model  with  y!  =  log(LRi(x))  and 
y2  =  log(LR2(x)),  and  decision  parameters  given  by  e71  =  P(7t3)/P(7Ti)  and  e72  =  P(7r3)/P(7r2).  It  is  interesting 
to  note  that  the  Scurfield  decision  rule  model  can  in  fact  be  used  to  describe  ideal  observer  performance  for  an 
even  wider  class  of  operating  points,  as  shown  in  this  section. 

4.  THE  CHAN  DECISION  RULE 

Chan  etal.  are  investigating  three-class  classifiers  for  computer-aided  diagnosis.14  Their  work  is  motivated  by 
reasoning  similar  in  principle  to  that  which  we  independently  arrived  at  when  we  began  to  consider  this  problem. 
In  particular,  they  consider  a  clinical  situation  in  which  observations  must  be  classified  as  malignant,  benign, 


Figure  5.  The  decision  rule  investigated  by  Chan  etal.,  which  as  they  state  is  a  special  case  of  the  ideal  observer  decision 
rule.  Observations  in  the  unlabelled  region  are  decided  “not  773”,  i.e.,  either  “7Ti”  or  “772”. 


or  normal.  Because  the  goal  of  their  work  is  to  optimize  the  performance  of  a  system  to  aid  a  radiologist  or 
clinician,  rather  than  to  measure  the  psychophysical  performance  of  an  existing  observer,  they  choose  to  start 
explicitly  from  an  ideal  observer  model  in  constructing  their  decision  rule. 

In  order  to  reduce  the  complexity  of  the  ideal  observer  decision  rule  to  manageable  proportions,  Chan  etal. 
impose  restrictions  on  the  utilities  used  by  their  observer.  In  their  formulation,  the  class  we  are  labelling  7Ti  is 
the  benign  class;  7r2,  the  normal  class;  and  the  malignant  class  is  7T3.  They  further  assume  that  the  possible 
values  of  any  utility  U^j  are  restricted  to  the  interval  [0, 1].  They  then  set  U\\\  =  U2\2  =  U3 13  =  1  (i.e.,  correctly 
identifying  any  case  has  maximal  utility).  Furthermore,  they  require  U2\i  =  U3 \2  =  1  and  U3\3  =  U2 13  =  0 
(i.e.,  misidentifying  a  benign  case  as  normal,  or  vice  versa,  has  no  significant  cost  reducing  the  utility  of  such  a 
decision  from  the  maximum,  but  misclassifying  an  actually  malignant  case  as  benign  or  normal  has  the  minimum 
possible  utility).  Finally,  U3\3,  and  U3\2  are  assumed  to  have  arbitrary  values  on  the  open  interval  (0,1)  (i.e., 
misclassifying  an  actually  non-malignant  case  as  malignant  will  have  some  cost  reducing  the  utility  of  such 
a  decision  from  the  maximum,  but  such  a  misclassification  is  in  some  sense  “better”  than  missing  an  actual 
malignancy).  It  is  important  to  note  that  these  assumptions  are  arguably  relevant  to  a  reasonable  model  of  a 
clinical  situation,  and  are  thus  of  interest  beyond  their  superficial  advantage  in  reducing  the  degrees  of  freedom 
involved  in  the  observer’s  decision  rule.  We  will,  however,  only  consider  the  latter  issue  in  the  remainder  of  this 
section. 

Substituting  the  values  of  the  utilities  given  above  into  Eq.  4,  we  obtain  decision  boundary  lines  of  the  form 


0  LRi  + 

(l-C/3|1)P(t  =  7r1)TT>  ,  (1  —  U3\2)P(t 

- bill  H - 

a  a 

0--U3\1)P(t  =  n1)TT>  ,  (l-Um)P(t 

- LKi  H - 

a  a 


0LR2 


0 

P(t  =  7T3) 
a 

P(  t  =  7T3) 
a 


{“l-us-2”} 

{“l-vs.-3”} 

{“2-us.-3”} 


(19) 

(20) 

(21) 


where  a  =  1  +  P( t  =  7r3)  —  U3\iP(t  =  7Ti)  —  U3\2P(t  =  7 r2).  Note  that,  as  Chan  etal.  point  out,  the  “1-VS.-2” 
line  is  in  fact  undefined  for  this  choice  of  utilities,  while  the  “1-VS.-3”  and  “2-US.-3”  lines  are  identical.  This  is  a 
general  consequence  of  Eqs.  7-9;  if  any  two  of  these  equations  yield  identical  lines,  the  third  line  must  be  either 
identical  to  them  or  undefined.  The  decision  rule  considered  by  Chan  etal.  is  illustrated  in  Fig.  5. 


Figure  6.  Decision  rule  investigated  by  Mossman,  for  the  decision  parameters  a  and  f3,  shown  in  the  a  posteriori  class 
probability  space. 


5.  THE  MOSSMAN  DECISION  RULE 

Mossman  investigates  a  decision  rule  applied  to  a  set  of  three  decision  variables  y1,  y2,  and  y3,  subject  to  the 
constraint 

yi  +  y2  +  y3  =  1>  (22) 

as  well  asO<yi<l  {1  <  «  <  3}.  This  is  consistent  with  the  constraint  on  the  a  posteriori  class  probabilities, 
P(7Ti|x)  +  P(7t2  |x)  +  P(7T3  |x)  =  1;  these  quantities  are  known  to  be  directly  related  to  the  likelihood  ratio  ideal 
observer  decision  variables.18, 19  (In  this  section  we  will  write  P(ni\x)  instead  of  P(t  =  7Tj|a?)  for  simplicity.) 
Mossman  does  not  explicitly  require,  however,  that  the  decision  variables  in  Eq.  22  be  the  a  posteriori  class 
probabilities  (e.g.,  they  may  be  noisy  estimates  of  these  quantities). 

The  decision  rule  considered  by  Mossman,  which  depends  on  two  decision  parameters  a  and  (3,  is 


decide 

d  = 

711 

iff 

2/2 

-2/1  <  0 

and 

2/3  < 

(23) 

decide 

d  = 

7T2 

iff 

2/2 

~  2/1  >  P 

and 

2/3  <  a; 

(24) 

decide 

d  = 

713 

iff 

2/3  >  ol. 

(25) 

where  0  <  a  <  1  and  —1  <  (3  <  1.  From  these  relations,  and  given  the  relation  j/3  =  1  —  z/i  —  2/2  from  Eq.  22, 
one  can  define  the  decision  boundary  lines 

Pi  —  2/2  =  ~(3  {“1-VS.-2”}  (26) 

2/1  +  2/2  =  l- a  {“1-VS.-3”}  (27) 

2/1  + 2/2  =  1  ~  ot  {“2-WS.-3”}.  (28) 

This  decision  rule  is  illustrated  in  Fig.  6.  Note  that,  similar  to  the  Chan  etal.  decision  rule,  the  “1-VS.-3”  and 
“2-us.-3”  decision  boundary  lines  are  identical. 

We  now  consider  a  special  case  of  the  Mossman  decision  rule  in  which  y:  =  P(7Ti|x),  y2  =  P(7T2|x),  and 
y3  =  P(7T3|x)  for  some  observational  data  vector  x.  This  version  of  the  decision  rule  is  illustrated  in  Fig.  7. 

Although  the  Mossman  decision  rule  appears  similar  in  form  to  the  ideal  observer  decision  rule,  recall  from 
Sec.  4  that  if  two  of  the  decision  boundary  line  equations  are  identical,  the  third  must  yield  a  line  identical  to 


Figure  7.  Decision  rule  investigated  by  Mossman,  for  the  decision  parameters  a  and  (3,  shown  in  likelihood  ratio  space. 

the  first  two  or  be  undefined.  Another  way  to  see  this  is  to  note  that  the  coefficients  of  Eq.  9  are  differences  of 
the  corresponding  coefficients  of  Eqs.  7  and  8.  If  the  coefficients  of  Eqs.  8  and  9  are  identical,  it  must  be  the  case 
that  the  coefficients  of  Eq.  7  are  all  zero.  For  the  Mossman  decision  rule,  this  would  require  l  +  /3  =  0,  1  —  (3  =  0, 
and  f3  =  0  simultaneously,  which  is  clearly  impossible.  It  follows  that  the  decision  rule  considered  by  Mossman 
cannot  represent  possible  ideal  observer  performance  for  any  choice  of  the  utilities  U%\:j  in  Eqs.  1  and  2. 

6.  DISCUSSION  AND  CONCLUSIONS 

We  examined  three  decision  rules  proposed  recently  for  three-class  classification  tasks  by  different  researchers. 
The  basis  for  our  evaluation  was  ideal  observer  decision  theory,  primarily  because  our  own  interest  in  the  three- 
class  classification  task  is  its  possible  application  to  CAD. 

Although  this  is  not  the  most  general  approach  to  three-class  classification,  the  three-class  classification  task 
is  difficult  enough  that  it  is  perhaps  worth  making  any  attempt  to  analyze,  from  a  single  point  of  view,  the  work 
of  the  relatively  few  researchers  investigating  this  problem. 

In  particular,  Scurfield  points  out13  that  his  proposed  decision  rule  is  in  fact  an  ideal  observer  decision  rule 
for  a  single  ideal  observer  operating  point,  namely  the  observer  which  maximizes  the  probability  of  any  correct 
response  (or  “percent  correct”  or  Pc)-  We  were  able  to  show  that,  under  various  assumptions,  a  larger  set  of 
such  correspondences  between  the  Scurfield  observer  and  the  ideal  observer  exists. 

Chan  etal.  are  working  on  the  application  of  three-class  classification  to  CAD,  and  thus  explicitly  take 
the  ideal  observer  as  the  starting  point  in  the  development  of  their  decision  rule.14  Although  this  rendered  our 
analysis  of  that  decision  rule  in  terms  of  ideal  observer  decision  theory  largely  trivial,  it  provided  an  intuitive 
basis  for  understanding  the  results  of  similar  analysis  of  the  Mossman  decision  rule,  namely  the  conclusion  that 
the  latter  does  not  correspond  to  ideal  observer  behavior  for  any  possible  values  of  the  utilities  used  by  the  ideal 
observer.  However,  we  note  that  the  structure  of  the  Mossman  decision  rule  —  a  simple  sequence  of  thresholds 
on  single  decision  variables  —  may  indeed  serve  as  a  reasonable  model  for  human  observer  performance  in  certain 
situations,  e.g.,  differential  diagnosis. 
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Abstract 

We  analyze  recently  proposed  decision  rules  for  three-class  classification  from  the  point  of  view  of  ideal  observer  decision  theory.  We 
consider  three-class  decision  rules  proposed  by  Scurfield,  by  Chan  et  al.,  and  by  Mossman.  Scurfield’s  decision  rule  is  shown  to  be  a 
special  case  of  the  three-class  ideal  observer  decision  rule  in  three  different  situations.  Chan  et  al.  start  with  an  ideal  observer  model  and 
specify  its  decision-consequence  utility  structure  in  a  way  that  causes  two  of  the  decision  lines  used  by  the  ideal  observer  to  overlap  and 
the  third  line  to  become  undefined.  Finally,  we  show  that,  for  a  particular  and  obvious  choice  of  ideal-observer-related  decision  variables, 
the  Mossman  decision  rule  cannot  be  a  special  case  of  the  ideal  observer  decision  rule.  Despite  the  considerable  difficulties  presented  by 
the  three-class  classification  task,  the  three-class  ideal  observer  provides  a  useful  framework  for  analyzing  a  variety  of  three-class  decision 
strategies. 

©  2006  Elsevier  Inc.  All  rights  reserved. 
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1.  Introduction 

We  are  attempting  to  develop  a  fully  automated  mass 
lesion  classification  scheme  for  computer-aided  diagnosis 
(CAD)  in  mammography.  This  scheme  will  combine  two 
schemes  developed  at  the  University  of  Chicago;  one  for 
automatically  detecting  mass  lesions  in  mammograms 
(Bick  et  al.,  1995;  Kupinski,  2000;  Yin  et  al.,  1991,  Yin, 
Giger,  Vyborny,  Doi,  &  Schmidt,  1993,  1994),  and  one  for 
classifying  known  lesions  as  malignant  or  benign  (Huo, 
Giger,  &  Metz,  1999;  Huo,  Giger,  &  Vyborny,  2001;  Huo, 
Giger,  Vyborny,  &  Metz,  2002;  Huo,  Giger,  Vyborny, 
Wolverton,  &  Metz,  2000;  Huo  et  al.,  1998).  Combining 
these  two  types  of  CAD  scheme  is  inherently  difficult, 
because  the  output  of  the  detection  scheme  will  necessarily 
include  false-positive  (FP)  computer  detections  in  addition 
to  the  malignant  and  benign  lesions  to  be  classified.  These 
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FP  computer  detections  correspond  to  objects  which 
were  by  design  not  included  in  the  training  sample 
of  the  classification  scheme,  because  they  are  not  members 
of  the  data  population  (benign  and  malignant  mass 
breast  lesions)  for  which  the  classification  scheme  was 
created.  It  is  clear  then  that  the  detection  scheme’s  output 
cannot  be  used  unmodified  as  the  input  to  the  classification 
scheme. 

Our  approach  has  been  to  treat  this  problem  explicitly  as 
a  three-class  classification  task.  That  is,  the  outputs  of  the 
detection  scheme  should  be  classified  as  malignant  lesions, 
benign  lesions,  and  non-lesions  (FP  computer  detections), 
and  the  classifier  to  be  estimated  is  the  ideal  observer 
decision  rule  for  this  task.  Such  an  approach  presents 
considerable  difficulties  of  its  own.  On  the  one  hand, 
decision  rules,  in  particular  ideal  observer  decision  rules, 
increase  rapidly  in  complexity  with  the  number  of  classes 
involved.  On  the  other  hand,  a  fully  general  performance 
evaluation  method,  such  as  a  three-class  extension  of 
receiver  operating  characteristic  (ROC)  analysis,  has  yet  to 
be  developed.  It  should  be  mentioned  that  the  simple  model 
we  have  just  described  corresponds  in  the  two-class 
classification  task  to  ROC  analysis  performed  “per 
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detection;”  that  is,  each  “case”  being  classified  corresponds 
to  a  small  region  of  interest  (ROI)  in  the  image  containing 
a  single  computer  detection.  Other  formulations,  such  as 
ROC  analysis  “per  image,”  ROC  analysis  “per  patient” 
(for  a  set  of  images,  such  as  the  four  mammographic  views 
obtained  in  a  typical  screening  setting),  or  free-response 
ROC  (FROC)  (Bunch,  Flamilton,  Sanderson,  &  Simmons, 
1978;  Chakraborty,  1989,  2002)  analysis,  are  also  possible, 
but  their  extension  to  tasks  with  three  or  more  classes  is 
beyond  the  scope  of  the  present  work. 

The  explicit  form  of  the  decision  rule  used  by  the  ideal 
observer  in  a  three-class  classification  task  has  been  known 
for  some  time  (Van  Trees,  1968).  For  the  reasons  just 
stated,  however,  a  practical  and  general  method  for 
estimating  and  evaluating  observer  performance  has 
proven  elusive.  In  particular,  Scurfield  (1996)  defined  the 
two-class  information-based  performance  metric  D\  2  = 
log  2  -  AUC  log  AUC  -  (1  -  AUC)log(l  -  AUC)  (where 
AUC  is  the  area  under  the  two-class  ROC  curve),  and 
extended  it  to  the  three-class  case  for  two  different  decision 
rules  (Scurfield,  1996,  1998).  Srinivasan  (1999)  investigated 
the  optimality  of  discrete,  multi-class  ROC  operating 
points,  but  not  continuous  ROC  hypersurfaces,  under  a 
cost  function  equivalent  to  the  Bayes  risk.  Mossman  (1999) 
evaluated  the  performance  of  a  three-class  classifier  with  a 
surface  formed  from  the  three  correct  classification 
probabilities.  Fland  and  Till  (2001)  proposed  the  average 
of  the  areas  under  all  N(N  —  l)/2  between-class  ROC 
curves  as  a  performance  metric  in  an  /V-class  classification 
task.  Obuchowski  et  al.  (2001)  elicited  readers’  estimates  of 
the  set  of  probabilities  of  each  observation  belonging  to  N 
classes,  and  then  used  conventional  (two-class)  ROC 
analysis  to  evaluate  each  of  the  N(N  —  l)/2  differences  of 
these  estimates  for  its  ability  to  distinguish  between  the 
relevant  pair  of  classes.  Ferri,  Hernandez-Orallo,  and 
Salido  (2003)  proposed  a  variety  of  algorithms  for 
calculating  the  hypervolume  under  the  convex  hull 
obtained  from  a  set  of  discrete  ROC  operating  points;  a 
modified  version  of  the  Fland  and  Till  metric  averaging  the 
N  areas  under  the  ROC  surfaces  that  measure  the 
observer’s  ability  to  distinguish  a  given  class  from  the 
remaining  N  —  1;  and  a  graphical  “cobweb”  representation 
of  the  observer’s  misclassification  probabilities.  Lachiche 
and  Flach  (2003)  proposed  iterative  algorithms  for  finding 
the  optimal  among  a  discrete  set  of  multi-class  ROC 
operating  points  based  on  either  percent  correct  or  Bayes 
risk.  Nakas  and  Yiannoutsos  (2004)  considered  an 
observer  using  a  decision  rule  similar  to  that  of  Scurfield 
(1996),  and  evaluated  its  performance  statistically  by 
extending  methods  proposed  by  Dreiseitl,  Ohno-Machado, 
and  Binder  (2000).  Patel  and  Markey  (2005)  applied  a 
variety  of  proposed  evaluation  metrics,  including  the  Hand 
and  Till  metric,  the  modified  Hand  and  Till  metric  of 
Ferri,  the  “cobweb”  graphical  measure  of  Ferri,  and  the 
Mossman  ROC  surface,  to  radiologist  assessment  data  of 
mammographic  images  from  patients  who  subsequently 
underwent  biopsy. 


The  works  cited  above  demonstrate  the  difficulty  in 
developing  a  fully  general  performance  metric  for  classifi¬ 
cation  tasks  with  more  than  two  classes.  Lacking  such  a 
performance  metric  in  turn  makes  the  development  of 
observer  decision  rules  for  such  tasks  difficult,  because  they 
can  at  present  be  evaluated  and  compared  only  from  a 
theoretical  rather  than  an  empirical  perspective.  Never¬ 
theless,  observer  decision  rule  models  for  three-class 
classification  tasks  have  been  proposed  relatively  recently 
by  several  groups  of  researchers.  In  some  cases,  these 
models  are  motivated  more  by  considerations  of  tract- 
ability  than  of  complete  generality.  This  is  of  course 
understandable  given  the  inherent  difficulties  of  three-class 
classification;  however,  we  thought  it  might  be  of  interest 
to  analyze  a  number  of  recently  proposed  three-class 
decision  rule  models  within  an  ideal  observer  decision  rule 
framework. 

In  the  next  section,  we  review  the  three-class  ideal 
observer  decision  rule.  In  the  following  three  sections,  we 
review  recently  proposed  three-class  decision  rule  models: 
one  by  Scurfield  (1998),  one  by  Chan,  Sahiner,  Hadjiiski, 
Petrick,  and  Zhou  (2003),  and  one  by  Mossman  (1999).  In 
each  case,  the  given  decision  rule  is  analyzed  in  terms  of  the 
ideal  observer  decision  rule;  where  necessary  or  expedient, 
assumptions  are  made  about  the  observer’s  decision 
variables  in  order  to  facilitate  this  analysis.  We  emphasize 
that  we  do  not  attempt  a  review  of  the  experimental 
methods  or  detailed  analysis  of  proposed  performance 
evaluation  metrics  in  the  works  discussed;  we  are  here 
interested  only  in  the  form  of  the  decision  rule  which  serves 
as  the  starting  point  for  each  work,  and  superficially  in  the 
proposed  evaluation  metrics  inasmuch  as  they  are  related 
to  those  decision  rules.  (Because  of  the  lack  of  a  fully 
general  performance  metric,  or  figure  of  merit,  for  the 
three-class  classification  task  and,  in  particular,  apparent 
inconsistencies  which  are  obtained  from  a  straightforward 
generalization  of  the  area  under  the  ROC  curve  (Edwards, 
Metz,  &  Nishikawa,  2005)  we  do  not  attempt  any 
validation  or  quantitative  comparison  of  the  proposed 
performance  metrics.)  The  results  of  our  analyses  are 
briefly  summarized  in  Section  6. 

2.  The  three-class  ideal  observer 

It  can  be  shown  (Edwards,  Metz,  &  Kupinski,  2004b; 
Van  Trees,  1968)  that  an  A-class  ideal  observer  makes 
decisions  regarding  statistically  variable  observations  x  by 
partitioning  a  likelihood  ratio  decision  variable  space, 
where  the  boundaries  of  the  partitions  are  given  by 
hyperplanes 

decide  d  =  7r,  iff 

N—\ 

—  Uj\k)P(t  =  Tz^lLRi- 

k=  1 

^  (  U )\N  -  U i\N)P{\  =  71  a0  {/ <  i}  ( 1 ) 
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and 


N- 1 

J2(ui\k-UM)P(t  =  nk)LRk 

k=  1 

>(Uj\N  -  Ui\N)P(i  =  nN)  {/>/' }.  (2) 

Here  U ^  is  the  utility  of  deciding  an  observation  is  from 
class  Uj  given  that  it  is  actually  from  class  ^ ij,  and  the  N  —  1 
likelihood  ratios  are  defined  as 


LR/  ^  PJJ jt  -  Kk) 

It  =  nN) 


(3) 


for  k<N.  We  also  define  the  actual  class  (the  “truth”)  to 
which  an  observation  belongs  as  t,  and  the  class  to  which  it 
is  assigned  (the  “decision”)  as  d,  where  t  and  d  can  take  on 
any  of  the  values  k\,  . . . ,  Ui, . . . ,  nN,  the  labels  of  the  various 
classes.  (We  use  boldface  type  to  denote  statistically 
variable  quantities.)  For  simplicity,  we  will  usually  write 
nk  to  denote  the  event  t  =  nk,  as  in  the  a  priori  probability 
P(nk). 

The  partitioning  of  the  decision  variable  space  is 
determined  by  the  parameters 


7 ijk  -  (Unk  -  Uj\k)P(nk),  (4) 

with  /',  j,  and  k  varying  from  1  to  N,  and  ///’■  Note  that 
these  parameters  are  not  independent,  however,  because 


7  ijk  7  kjk  7  kik-  (3) 

We  can  impose  the  reasonable  condition  that  the  utility 
for  correctly  classifying  an  observation  from  a  given  class 
should  be  greater  than  any  utility  for  incorrectly  classifying 
an  observation  from  the  same  class,  i.e.,  >  Ujy  {/'//}• 

This  gives,  for  ///, 

7yi>0,  (6) 

leaving  N(N  —  1)  parameters  (the  rest  are  derivable 
from  (5)). 

Finally,  note  that  the  hyperplanes  represented  by  (1)  and 
(2)  are  unchanged  if  we  multiply  all  of  these  relations  by  a 
single  scalar,  such  as  1  ■  y,/V).  This  leaves  us  with  N 2  — 

N  —  1  degrees  of  freedom,  as  expected,  and  effectively 
imposes  the  condition 


Fig.  1 .  Example  three-class  ideal  observer  decision  rule,  given  the  values 
of  the  decision  parameters  y121  =  y212  =  n  and 
7131  =  7313  =  7232  =  7323  =  ?•  Note  that  7 iji  =  W i\i  ~  Uj\,)P( t  =  Jt/). 

decision  boundary  line  equations  as  the  difference  in 
utilities  between  a  “correct”  and  one  particular  “incorrect” 
decision  (scaled  by  the  a  priori  probability  of  the  true  class 
in  question);  and  of  each  difference  in  the  parameters  as 
a  difference  in  utilities  between  two  possible  “incorrect” 
decisions  (again  scaled  by  the  a  priori  probability  of  the 
true  class  in  question). 

An  example  ideal  observer  decision  rule  for  particular 
values  of  the  utilities  U^,  and  hence  of  the  parameters 
is  shown  in  Fig.  1.  Here  we  have  chosen  ym  =  y2 \2  = 
and  y13l  =  y313  =  y232  =  y323  =  ',  yielding  the  decision 
boundary  lines 

^LR,-^LR2  =  0  {  “  l-vs.-2”  }  ,  (11) 

ALRi-,LLR2  =  i  {“1-IW.-3”},  (12) 


The  behavior  of  a  three-class  ideal  observer  is  completely 
determined  by  the  three  decision  boundary  lines 


—  j^LRi  +  yLR2  =  1  {  “  2-vs.-3”  }  .  (13) 

These  simplify  to  the  equations  LR2  =  LR], 
LR2  =  2LRi  —  2,  and  LR2  =  LR//2  +  1,  respectively. 


7121  LRi  —  7212LR2  =  7313  —  7323?  (8) 

7l3lLRl  +  (7232  ~  7212)LR-2  =  7313’  (9) 

(7i3i  —  7i2i)LRi  +  y232LR2  =  y323,  (10) 

which  we  call,  respectively,  the  “l-us.-2”  line,  the  “l-ra.-3” 
line,  and  the  “2-vs.-3”  line.  Note  that  if  any  two  of  these 
lines  intersect,  the  third  line  must  also  share  this  intersec¬ 
tion  point.  We  also  emphasize  the  simple  interpretation, 
from  (4),  of  each  of  the  yi2i  parameters  appearing  in  these 


3.  The  Seurfield  decision  rule 

Scurfield  investigated  a  decision  rule  applied  to  two- 
dimensional  statistically  variable  data  (y  =  (yj,y2))  drawn 
from  three  classes  (Scurfield,  1998).  The  application 
domain  was  human  observer  performance  modeling  for 
acoustical  psychophysics  experiments.  (In  prior  work, 
Scurfield  investigated  a  decision  rule  for  three-class 
classification  of  univariate  data  (Scurfield,  1996).  We  will 
not  review  that  prior  work  here,  because  at  present  we  are 
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interested  in  relating  given  observer  models  to  the  general 
three-class  ideal  observer  model  for  multivariate  observa¬ 
tional  data,  which — except  in  degenerate  cases — will  yield 
two-dimensional  decision  variable  data  by  (3).)  In 
Scurfield’s  work,  no  assumptions  are  made  about  the 
decision  variables  y,  and  y2;  in  particular,  these  decision 
variables  are  not  assumed  to  be  related  in  any  way  to  an 
ideal  observer  model.  This  is  entirely  appropriate  given  the 
nature  of  the  problem  domain  Scurfield  investigated — i.e., 
human  observer  performance  modeling.  It  can  readily  be 
shown,  however,  that  if  one  chooses  to  make  such 
assumptions,  special  cases  of  the  Scurfield  model  are  in 
fact  special  cases  of  an  ideal  observer  decision  rule. 

The  Scurfield  decision  rule  is  dependent  on  two  decision 
parameters,  which  we  will  call  yj  and  y2.  The  decision  rule 
can  be  written  as 

decide  d  =  m  iff  yx  —  y2^y\  —  y2  and  yu  (14) 

decide  d  =  n2  iff  yx  —  y2<y\  —  y2  and  _y2>y2,  (15) 


decide  d  =  n2  iff  vx  <  y !  and  y2  <  y2  •  (16) 

This  decision  rule  is  illustrated  in  Fig.  2. 

From  these  relations,  one  can  define  the  decision 
boundary  lines 

Ji  -  72  =  71-72  {  “  l-vs.-2”  }  ,  (17) 

Ji  =  7i  {  “  l-vs.-3”  }  ,  (18) 

yi  =  72  {“2-VS.-3”}.  (19) 

If  we  choose  y,  =  LRi(x)  and  y2  =  LR2(x)  for  some  set  of 
observational  data  x,  we  have 

—  LR,  -  —  LR^  =  ~  72  {  “  l-vs.-2”  }  ,  (20) 

To  7o  7o 


Fig.  2.  Decision  rule  investigated  by  Scurfield.  for  the  decision  parameters 
y!  and  y2. 


—  LRi  =  —  {“l-uy.-3”},  (21) 

7o  7o 

—  LRt  =  —  {  “  2-vs.-3”  }  ,  (22) 

7o  7o 

where  y0  =  yx  +  y2  +  4  (to  impose  consistency  with  (7)). 
Note  the  similarity  in  form  between  these  equations  and 
(8)  (10).  If  we  require  yx  and  y2  to  be  positive,  the 
correspondence  is  exact,  and  this  special  case  of  (8)  (1 0)  is 
illustrated  in  Fig.  3.  (In  fact,  the  intersection  of  the  ideal 
observer  decision  boundary  lines  can  lie  in  any  quadrant. 
However,  given  a  set  of  decision  boundary  lines  with  slopes 
as  depicted  in  Fig.  2,  the  occurrence  of  the  intersection 
point  in  any  quadrant  other  than  the  first  would  result  in 
an  ideal  observer  operating  point  for  which  no  observa¬ 
tions  were  assigned  to  class  713.  This  “degenerate”  case  will 
not  be  considered  here.)  As  an  aside,  it  is  of  some  interest 
to  note  that  if  yx  =  y2  =  1,  the  decision  boundary  line 
equations  reduce  to  LR]  =  LR2,  yielding p{x\n\)  =  p(x|7i2); 
LRi  =  1,  yielding p(x\n\)  —  p{x\n^)-,  and  LR2  =  1,  yielding 
p(x\n2)  —  p(x\n-$).  That  is,  the  decision  boundary  lines 
correspond,  in  the  observational  data  space,  to  the  loci  of 
intersection  of  the  observational  data  probability  density 
functions.  (This  is  illustrated  in  Figs.  2B  and  2C  of 
Scurfield  (1998).) 

A  second  correspondence  between  Scurfield’s  decision 
rule  and  the  ideal  observer  decision  rule  can  be  obtained  by 
taking  y[  =  log(LRi(x))  and  y2  =  log(LR2(x)),  with  yx  and 
y2  now  unrestricted.  Substituting  this  definition  in 


Fig.  3.  A  special  case  of  the  ideal  observer  decision  rule  with 
7i2i  =  Im  =  7ni  =  7232  =  1  /  (7 1  +  72  +  4),  7313  =  7  i  /(7  i  +  72  +  4),  and 
7323  =  72/(7i  +  72  +  4).  The  parameters  and  y2  are  positive  but 
otherwise  arbitrary;  this  decision  rule  is  a  special  case  of  the  Scurfield 
decision  rule  with  yj  =  LRi(x)  and  y2  =  LRjtx). 
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(17)— (19),  we  obtain 

log(LRi)  -  log(LR2)  =  71-72  f  “  i-vs.-2”  }  ,  (23) 

log(LR,)  =  7j  {  “  l-us.-3”  }  ,  (24) 

log(LR2)  =  72  {  “  2-vs.-3”  }  .  (25) 

Taking  exponentials  on  each  side  of  these  equations  then 
gives 

J^i  =  e7i-72  {  “  \-vs.-2”  }  ,  (26) 

L  1\2 

LR,  =  e71  {  “  1-IW.-3”  }  ,  (27) 

LR2  =  e72  {  “  2-US.-3”  }  ,  (28) 

we  can  then  rearrange  terms  and  divide  the  equations  by  a 
constant  factor  to  obtain 

e-y\  e-n 

- LRj - LRt  =  0  {  “  l-vs.-2”  }  ,  (29) 

7o  7o 


1 

- LR!=—  {  “  l-uy.-3”  }  , 

(30) 

O 

O 

—  lr2=  i 

I  “  2-US.-3”  }  , 

(31) 

7o  7o 

where  y0  =  2(e~yi  +  e~y-  +  1).  By  inspection,  this  is  again  a 
special  case  of  (8)  (TO),  which  is  illustrated  in  Fig.  4.  (This 
special  case  is  currently  the  subject  of  independent  analysis 
by  He,  Metz,  Tsui,  Links,  &  Frey,  2006.)  As  an  aside,  we 
note  that  if  y,  =  y2  =  0,  the  resulting  decision  boundary 
lines  again  correspond,  in  the  observational  data  space,  to 
the  loci  of  intersection  of  the  observational  data  prob¬ 
ability  density  functions,  as  was  pointed  out  in  the  text 
following  (20)-(22). 

Finally,  if  we  take  yj  =  R(7ri|x)  and  y2  =  P(n2\x),  and 
require  0 < <  1  and  0<y2<l,  we  obtain 

P(n 1 I*)  -  2\x)  =  7i  —  72  {  “  l -vs. -2”  }  ,  (32) 

P(m\x)  =  yl  {  “  1-UJ.-3”  }  ,  (33) 


P(n2\x)  =  y2  {  “  2-vs.-3”  }  , 


as  illustrated  in  Fig.  5. 

Note  that  (3)  can  be  written  as 


LR,  = 


P(Hi\x)p(x)  /  P(Hj) 
p(x\n3) 


{i:  l</^2}, 


(34) 


Fig.  4.  A  special  case  of  the  ideal  observer  decision  rule  with 

7l21  =  7131  =  e  1,1  ho.  7212  =  7232  =  e  71  /7o>  7313  =  7323  =  l/7o>  and 
y0  m  2(e~r'  +  e~yi  +  1).  The  parameters  ^  and  y2  are  arbitrary;  this 
decision  rule  is  a  special  case  of  the  Scurfield  decision  rule  with  y,  = 
log(LRj(x))  and  y2  =  log(LR2(x)). 


Fig.  5.  A  special  case  of  the  Scurfield  decision  rule  with  yj  =  i^jii|.v)  and 
y2  =  P(n2\x). 


P(Tlj\x)  = 


LR,R(7T,) 
p(x)/p(x\n3)  ’ 


P(ni\x)  = 


LR,[R(7r,)/R(7t3)] 


1  +  LRjf^^O/R^)]  +  LR2[P(n2)/P(Ti3)] ' 


(35) 


This  allows  us  to  rewrite  (32)-(34)  as 
1  -  (7i  -  72)  P(n  1)  LR  _  1  +  (7i  ~  7i)  i>(7C2)LR 


7o  P(n  3) 

71-72 
7o 


7o 


P(n3) 


(36) 
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l-yiP(ni)  >1  ^2)  T  r,  Vl 

?o  p(n 3)  y0  P(n 3)  7o 

V2  ^1)  Lp  1  -  y2  Pjni)  LR  =  y2 

y0P(n3)  1  y0  P(n3)  2  y0’ 


(37) 

(38) 


respectively,  where  y0  =  (2  —  2y,  +  y2)P{n\)  /  P{n2)+ 
(2  +  yj  —  2y2)P(n2)  /  P(n2)  +  y,  +  y2.  This  is  again  a  special 
case  of  (8)  (10),  as  the  quantities  1  —  (yi  —  y2)^ 
1  +  (y i  —  y2)>  1—  Vi,  and  1  —  y2  are  all  positive  given 
0 < y !  <  1  and  0<y2<l. 

Scurfield  (1998)  points  out  that  the  observer  which 
maximizes  Pc,  the  “percent  correct”  or  probability  of  a 
correct  response,  is  a  special  case  of  the  ideal  observer  (i.e., 
a  single  operating  point  achievable  by  the  ideal  observer  for 
the  given  task).  This  observer  follows  the  Scurfield  decision 
rule  model  with  y[  =  log(LRi(x))  and  y2  =  log(LR2(x)), 
and  decision  parameters  given  by  elx  =  P(n2) / P(n\)  and 
en  _  P(n2)/P(n2).  It  is  interesting  to  note  that  the  Scurfield 
decision  rule  model  can  in  fact  be  used  to  describe  ideal 
observer  performance  for  an  even  wider  class  of  operating 
points,  as  shown  in  this  section. 

To  evaluate  the  performance  of  an  observer  using  the 
decision  rule  in  ( 1 7)  (19),  Scurfield  plots  a  set  of 
six  surfaces  in  three-dimensional  ROC  spaces,  giving 
Rid  =  k2 |t  =  o£(7t2))  as  a  function  of  R(d  =  7ii|t  =  a(7ri)) 
and  R(d  =  |t  =  ot(n2)).  Here  a  is  one  of  the  six  possible 
permutations  of  three  symbols.  Scurfield  gives  a  probabil¬ 
istic  interpretation  for  this  evaluation  methodology:  the 
volume  under  each  surface  is  the  probability  of  a  particular 
outcome  in  a  three-alternative  forced  choice  experiment, 
and  thus  the  six  volumes  must  sum  to  one.  This  constraint 
means  that  at  most  five  of  the  surfaces  are  independent. 
However,  given  the  number  of  conditional  probabilities 
R(d  =  7i,- 1 1  =  7i j)  involved,  one  can  show  that  only  four  such 
surfaces  are  required  to  completely  specify  the  tradeoffs 
among  the  observer’s  conditional  classification  probabil¬ 
ities.  Without  loss  of  generality,  we  consider  plotting  each 
of  P(d  =  n2\t  =  ni),  P(d  =  n2\t  =  n3),  R(d  =  7i3|t  =  ttO, 
and  R(d  =  7T3 |t  =  n2)  as  functions  of  A(d  =  n\  |t  =  n2)  and 
R(d  =  7ri|t  =  n2).  (As  with  Scurfield’s  plots,  these  are  well 
defined  because  Scurfield’s  decision  rule  has  two  degrees  of 
freedom,  namely  the  parameters  yx  and  y2). 

Now  consider  one  of  Scurfield’s  plots,  for  example  that 
which  gives  R(d  =  7r2|t  =  n2)  as  a  function  of  R(d  =  71 1  |t  = 
7r  1 )  and  R(d  =  7r3 |t  =  713).  Because  these  are  conditional 
probabilities,  we  have 


R(d  =  7Ti|t  =  7Ii)  =  1  —  R(d  =  7T2|t  =  7T 1 ) 

-R(d  =  7T3|t  =  Til),  (39) 


R(d  =  7T2  |t  =  7I2)  =  1  —  R(d  =  7ti|t  =  7I2) 

—  R(d  =  7i3|t  =  7t2),  (40) 


P(d  =  77:3  |t  =  7i3)  =  1  —  R(d  =  7ii|t  =  713) 

—  P{d  =  n2\t  =  713).  (41) 


Each  of  the  conditional  probabilities  on  the  right-hand 
side  of  these  equations  can  be  written  as  functions  of 
R(d  =  711  |t  =  7t2)  and  R(d  =  71 1  |t  =  713)  in  our  formulation; 
thus,  the  surface  given  in  this  plot  is  determined 
parametrically  by  the  set  of  four  surfaces  we  have  given. 
Similar  remarks  hold  for  the  other  five  surfaces  used  by 
Scurfield.  In  general,  for  an  A-class  classification  task  using 
a  Scurfield-type  decision  rule  with  N  —  1  degrees  of 
freedom  (the  generalization  to  N  classes  of  (17) — (19)), 
one  can  show  that  a  set  of  (N  —  l)2  hypersurfaces  with 
N  —  1  degrees  of  freedom  in  A-dimensional  ROC  spaces  is 
necessary  to  fully  characterize  the  observer’s  performance, 
although  the  interpretation  of  those  hypersurfaces  is  not 
necessarily  as  straightforward  or  elegant  as  that  provided 
for  the  AC!  —  1  hypersurfaces  used  by  Scurfield. 


4.  The  Chan  decision  rule 

Chan  et  al.  are  investigating  three-class  classifiers  for 
computer-aided  diagnosis  (Chan  et  al.,  2003).  Their  work  is 
motivated  by  reasoning  similar  in  principle  to  that  which 
we  independently  arrived  at  when  we  began  to  consider  this 
problem.  In  particular,  they  consider  a  clinical  situation  in 
which  observations  must  be  classified  as  malignant,  benign, 
or  normal.  The  goal  of  their  work  is  not  just  the 
psychophysical  measurement  of  the  performance  of  an 
existing  (e.g.,  human)  observer,  but  the  optimization  of  the 
performance  of  a  system  (containing  components  with 
parameters  subject  to  experimental  control,  e.g.  an  artificial 
neural  network)  to  aid  a  radiologist  or  clinician.  Thus  they 
are  free,  at  least  in  theory,  to  start  explicitly  from  an  ideal 
observer  model  in  constructing  their  decision  rule. 

In  order  to  reduce  the  complexity  of  the  ideal  observer 
decision  rule  to  manageable  proportions,  Chan  et  al. 
impose  restrictions  on  the  utilities  used  by  their  observer. 
In  their  formulation,  the  class  we  are  labeling  71 1  is  the 
benign  class;  n2,  the  normal  class;  and  the  malignant  class 
is  713.  They  further  assume  that  the  possible  values  of  any 
utility  Ui\j  are  restricted  to  the  interval  [0, 1].  They  then 
set  (7i|i  =  U2 12  =  U 313  =  1  (i.e.,  correctly  identifying  any 
case  has  maximal  utility).  Furthermore,  they  require 
I/2|i  =  U q2  =  1  and  U\\2  =  U2 p  =  0  (i.e.,  misidentifying 
a  benign  case  as  normal,  or  vice  versa,  has  no  significant 
cost  reducing  the  utility  of  such  a  decision  from  the 
maximum,  but  misclassifying  an  actually  malignant  case  as 
benign  or  normal  has  the  minimum  possible  utility). 
Finally,  1/311  and  U2\2  are  assumed  to  have  arbitrary 
values  on  the  open  interval  (0, 1)  (i.e.,  misclassifying  an 
actually  non-malignant  case  as  malignant  will  have  some 
cost  reducing  the  utility  of  such  a  decision  from  the 
maximum,  but  such  a  misclassification  is  in  some  sense 
“better”  than  missing  an  actual  malignancy).  It  is 
important  to  note  that  these  assumptions  are  arguably 
relevant  to  a  reasonable  model  of  a  clinical  situation,  and 
are  thus  of  interest  beyond  their  superficial  advantage  in 
reducing  the  degrees  of  freedom  involved  in  the  observer’s 
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decision  rule.  We  will,  however,  only  consider  the  latter 
issue  in  the  remainder  of  this  section. 

Substituting  the  values  of  the  utilities  given  above  into 
(4),  we  obtain  decision  boundary  lines  of  the  form 

0LR!+OLR2  =  0  {  “  l-vs.-2”  }  ,  (42) 


7o 

P(n  3) 

7o 


{  “  l-iw.-3”  } 


(43) 


(1-  U3ll)P(m)1X}  (\-UM2)P(n2) 

—  L  K 1  -| - 


y  0 

P(n  3) 

To 


LR? 


7o 


{  “  2-vs.-3”  }  , 


(44) 


where  y0  =  1  +  P(n 3)  —  Ui\\P{n\)  —  U2\2P(n2).  Note  that, 
as  Chan  et  al.  point  out,  the  “l-vs.-2”  line  is  in  fact 
undehned  for  this  choice  of  utilities,  while  the  “l-ux-3”  and 
“2-rs. -3”  lines  are  identical.  This  is  a  general  consequence 
of  (8)  (10);  if  any  two  of  these  equations  yield  identical 
lines,  the  third  line  must  be  undefined.  (Note  that,  strictly 
speaking,  the  utility  structure  employed  by  Chan  et  al.  is 
excluded  from  our  formulation  by  the  requirement  stated 
in  (6).  However,  this  issue — i.e.,  whether  the  ideal 
observer’s  performance  should  be  considered  to  include 
such  limiting  cases — is  largely  a  definitional,  rather  than  a 
fundamental,  issue,  because  (6)  could  just  as  readily  have 
been  formulated  as  a  non-negativity  constraint,  rather  than 
a  strict  inequality  as  we  have  chosen). 

The  decision  rule  considered  by  Chan  et  al.  is  illustrated 
in  Fig.  6.  It  can  be  argued  that,  in  a  sense,  the  output  of  this 
classifier  belongs  to  only  two  classes,  malignant  and  non- 
malignant;  in  particular,  because  (42)  is  undefined,  this 
observer  will  never  unequivocally  decide  A  =  k\  (benign)  or 
7t2  (normal).  In  fact,  if  U2\\  —  C/312,  the  observer’s 
performance  is  identical  with  that  of  a  two-class  ideal 
observer  which  distinguishes  between  the  malignant  and 
non-malignant  (benign  plus  normal)  classes.  However,  in 
the  more  general  case  in  which  C/311#  E/312,  the  observer 
considered  by  Chan  et  al.  is  able  to  achieve  ROC  operating 
points  not  accessible  by  the  two-class  ideal  observer.  (That 
is,  the  three-class  ideal  observer  can  achieve  points  below 
the  two-class  ideal  observer’s  ROC  curve  in  a  two-class 
ROC  space,  or,  equivalently,  points  off  the  curve 
representing  the  two-class  ideal  observer’s  performance 
plotted  in  a  three-class  ROC  space.)  Intuitively,  their 
observer  makes  decisions  based  on  the  three  distribution 
functions  of  the  observational  data,  even  though  the 
observer’s  output  consists  of  only  two  possible  responses. 

Chan  et  al.  evaluate  the  performance  of  their  observer  by 
plotting  R(d  =  713 1 1  =  713)  as  a  function  of  R(d  =  713  |t  =  711) 
and  P(d  =  7i3|t  =  7t2).  Note  that  this  single  two-dimen¬ 
sional  surface  is  sufficient  to  completely  characterize  the 
tradeoffs  among  the  conditional  classification  probabilities 
of  their  observer.  This  is  because,  as  just  stated,  the 
observer’s  output  consists  of  only  two  possible  responses, 


Fig.  6.  The  decision  rule  investigated  by  Chan  et  al.,  which  is  a  special 
case  of  the  ideal  observer  decision  rule  with  ym  =  7212  =  O' 
7i3i  =  (1  —  Z/yi  )P(jt\  )//(}■  7232  =  (1  ~  Z~3|2 )/7d'  and  7313  =  >'323  = 
Z’(7r3)/7o;  here  7o  =  1  +  Pfai)  —  U-$\\P{ti\)  —  Uj\2P(n2).  Observations  in 
the  unlabeled  region  are  decided  "not  713”,  i.e.,  either  “711”  or  “712”. 
The  intercepts  yj  and  y2  are  P(k2)/[(  1  —  t/3|i)/’(7ri)]  and  _P(jr3)/ 
[(1  —  Chpl-ffe)],  respectively. 


and  thus  we  have  only  six  classification  probabilities 
R(d  =  71/ 1 1  —  7 ij)  rather  than  the  nine  expected  in  a  three- 
class  classification  task.  These  six  conditional  probabilities 
are  still  constrained  by  three  equations,  however: 

R(d  =  7t3|t  =  7Tj)  +  R(d  =  713  |t  =  7i  [ )  =  1,  (45) 

R(d  =  S3|t  =  tt2)  +  R(d  =  713 1 1  =  n2)  =  1,  (46) 

R(d  =  7t3|t  =  713)  +  R(d  =  713 1 1  =  713)  =  1,  (47) 

where  the  expression  d  =  713  indicates  that  the  observer 
decides  that  the  observation  does  not  belong  to  class  713. 
These  constraint  equations  allow  us  to  eliminate  three  of 
the  six  conditional  probabilities,  leaving  a  single  ROC 
surface  with  two  degrees  of  freedom  in  a  three-dimensional 
ROC  space. 

5.  The  Mossman  decision  rule 

Mossman  (1999)  investigates  a  decision  rule  applied  to  a 
set  of  three  decision  variables  yl5  y2,  and  y3,  subject  to  the 
constraint 

yi  +  y2  +  y3  =  1>  (48) 

as  well  as  0^y;^l  {l^i^3}.  This  is  consistent  with  the 
constraint  on  the  a  posteriori  class  probabilities, 
,P(7ri|x)  +  P(n2\x)  +  R(7t3|x)  =  1;  these  quantities  are 
known  to  be  directly  related  to  the  likelihood  ratio  ideal 
observer  decision  variables  (Edwards,  Lan,  Metz,  Giger,  & 
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Nishikawa,  2004a;  Kupinski,  Edwards,  Giger,  &  Metz, 
2001).  Mossman  does  not  explicitly  require,  however,  that 
the  decision  variables  in  (48)  be  the  a  posteriori  class 
probabilities  (e.g.,  they  may  be  noisy  estimates  of  these 
quantities). 

The  decision  rule  considered  by  Mossman,  which 
depends  on  two  decision  parameters  and  y2,  is 

decide  d  =  n\  iff  y2  —  V]  =^y2  and  ^3<yl5  (49) 


(52)-(54)  as 

n,  /WIB  n  ,  P(ki) 
il+y2)P{n7)  2 
=  -y2  f  “  l-vs.-2”  } , 


(55) 


Pin  1)  T  D  ,  P(ni)  T  D 
7i  yv — r  LRi  +  y\  —  x  LR2 


Pin,) 


Pin 3) 


1  —  7i  {  “  l-uy.-3”  } 


(56) 


decide  d  =  7i2  iff  y2  —  iq  >y2  and  j3<y3, 
decide  d  =  7r3  iff  _y3  >  y  3 . 


(50) 

(51) 


7i 


Pirn) 
Pin 3) 


LRi  +  yj 


P(n2) 
Pin 3) 


LR2 


-yi  {  “  2-vs.-3”  }  , 

(57) 


where  O^y^l  and  —  l^y2^l.  From  these  relations,  and 
given  the  relation  y3  =  1  —  yx  —  v2  from  (48),  one  can 
define  the  decision  boundary  lines 


Ji  -  JT  =  -y2  ( 

“  l-vs.-2”  }  , 

(52) 

Jl  +  J;2  =  1  -  7l 

{  “  l-uy.-3”  }  , 

(53) 

Ji  +y2  =  1  —  71 

{  “  2-VS.-3”  }  . 

(54) 

This  decision  rule  is  illustrated  in  Fig.  7.  Note  that,  similar 
to  the  Chan  et  al.  decision  rule,  the  “l-ra.-3”  and  “2-vs.-3” 
decision  boundary  lines  are  identical. 

We  now  consider  a  special  case  of  the  Mossman  decision 
rule  in  which  y,  =  P(7ii|x),  y2  =  P(n2\x),  and  y3  =  R(7i3|x) 
for  some  observational  data  vector  x.  As  in  Section  3,  we 
make  the  substitution  in  (35);  this  allows  us  to  rewrite 


This  version  of  the  decision  rule  is  illustrated  in  Fig.  8. 

Although  the  Mossman  decision  rule  for  this  choice  of 
decision  variables  appears  similar  in  form  to  the  ideal 
observer  decision  rule,  recall  from  Section  4  that  if  two  of 
the  decision  boundary  line  equations  are  identical,  the  third 
must  yield  a  line  identical  to  the  first  two  or  be  undefined. 
Another  way  to  see  this  is  to  note  that  the  coefficients  of 
(10)  are  differences  of  the  corresponding  coefficients  of  (8) 
and  (9).  If  the  coefficients  of  (9)  and  (10)  are  identical,  it 
must  be  the  case  that  the  coefficients  of  (8)  are  all  zero.  For 
the  Mossman  decision  rule,  this  would  require  1  +  y2  =  0, 
1  —  y2  =  0,  and  y2  =  0  simultaneously,  which  is  clearly 
impossible. 

It  follows  that,  for  this  particular  choice  of  decision 
variables  (related  in  a  straightforward  way  to  the  ideal 


Fig.  7.  Decision  rule  investigated  by  Mossman,  for  the  decision  Fig.  8.  Decision  rule  investigated  by  Mossman,  for  the  decision 

parameters  y,  and  y2,  shown  in  the  a  posteriori  class  probability  space.  parameters  and  y2,  shown  in  likelihood  ratio  space. 
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observer’s  decision  variables),  the  decision  rule  considered 
by  Mossman  cannot  represent  possible  ideal  observer 
performance  for  any  choice  of  the  utilities  {/; y  in  (1)  and 
(2).  (One  can  construct  probability  density  functions  such 
that  the  Mossman  observer’s  behavior  for  a  particular 
choice  of  decision  criteria  (yj  and  y2  in  (49)  (5 1)) 
corresponds  to  ideal  observer  behavior  at  a  particular 
operating  point.  However,  we  do  not  at  present  have  any 
reason  to  believe  that  this  result  can  be  generalized  to 
arbitrary  probability  density  functions  or  to  arbitrary 
choices  of  decision  criteria  for  a  given  choice  of  probability 
density  functions). 

Mossman  proposed  that  the  ROC  surface  obtained  by 
plotting  R(d  =  7T3 |t  =  7i3)  as  a  function  of  P(d  —  m\t  —  n\) 
and  R(d  =  7T2  |t  =  n2)  be  used  to  evaluate  the  performance 
of  the  observer.  Although  this  surface  is  clearly  well- 
defined  (the  Mossman  decision  rule  has  two  degrees  of 
freedom,  namely  the  parameters  y{  and  y2),  it  follows  from 
the  discussion  at  the  end  of  Section  3  that  four  such 
surfaces  in  three-dimensional  ROC  spaces  are  needed  to 
completely  characterize  the  tradeoffs  among  the  observer’s 
conditional  classification  probabilities. 

6.  Discussion  and  conclusions 

We  examined  three  decision  rules  proposed  recently  for 
three-class  classification  tasks  by  different  researchers.  The 
basis  for  our  evaluation  was  ideal  observer  decision  theory, 
primarily  because  our  own  interest  in  the  three-class 
classification  task  is  its  possible  application  to  CAD.  A 
major  goal  in  the  development  of  a  computerized  scheme 
for  CAD  is  the  optimization  of  the  performance  of  that 
scheme,  in  order  to  provide  the  maximum  benefit  to 
clinicians  and  thus  to  their  patients.  It  should  thus  be  kept 
clearly  in  mind  that  the  ideal  observer  framework  may  not 
be  as  relevant,  for  example,  to  work  which  is  motivated  by 
purely  psychophysical  considerations  (Mossman,  1999; 
Scurheld,  1996,  1998) — i.e.,  where  the  goal  is  to  estimate 
of  the  properties  of  an  existing  observer. 

That  being  said,  the  three-class  classification  task  is 
difficult  enough  that  it  is  perhaps  worth  making  any 
attempt  to  analyze,  from  a  single  point  of  view,  the  work  of 
the  relatively  few  researchers  investigating  this  problem, 
even  in  cases  where  that  point  of  view  is  not  necessarily 
relevant  to  the  underlying  motivations  for  that  work.  We 
feel  the  insights  we  have  gained  from  the  analysis  of 
various  decision  rules  presented  here  should  provide  at 
least  some  justification  for  that  claim. 

In  particular,  Scurheld  points  out  (Scurheld,  1998)  that 
his  proposed  decision  rule  is  in  fact  an  ideal  observer 
decision  rule  for  a  single  ideal  observer  operating  point, 
namely  the  observer  which  maximizes  the  probability  of 
any  correct  response  (or  “percent  correct”  or  Pc).  We  were 
able  to  show  that,  under  various  assumptions,  a  larger  set 
of  such  correspondences  between  the  Scurheld  observer 
and  the  ideal  observer  exists. 


Chan  et  al.  (2003)  are  working  on  the  application  of 
three-class  classihcation  to  CAD,  and  thus  explicitly  take 
the  ideal  observer  as  the  starting  point  in  the  development 
of  their  decision  rule.  Although  this  rendered  our  analysis 
of  that  decision  rule  in  terms  of  ideal  observer  decision 
theory  largely  trivial,  their  decision  rule  merits  attention  as 
an  example  of  a  situation  in  which  the  ideal  observer  is 
indeed  making  use  of  information  from  the  three  classes  of 
observations  (i.e.,  its  behavior  is  demonstrably  different 
from  that  of  a  two-class  ideal  observer),  while  only 
producing  two  different  responses  for  those  observations. 
In  two-class  classihcation,  the  only  corresponding  exam¬ 
ples  are  trivial:  either  the  observer  always  calls  observa¬ 
tions  positive  (achieving  an  operating  point  of 
(FPF  =  1,TPF  =  1),  where  FPF  is  the  false-positive 
fraction  and  TPF  the  true-positive  fraction)  or  always  calls 
them  negative  (FPF  =  0,  TPF  =  0). 

Finally,  we  showed  that,  given  a  particular  and  obvious 
choice  of  ideal-observer-related  decision  variables,  the 
decision  rule  proposed  by  Mossman  (1999)  does  not 
correspond  to  ideal  observer  behavior  for  any  possible 
values  of  the  observer’s  utilities.  However,  we  note  that  the 
structure  of  the  Mossman  decision  rule — a  simple  sequence 
of  thresholds  on  single  decision  variables — may  indeed 
serve  as  a  reasonable  model  for  human  observer  perfor¬ 
mance  in  certain  situations,  e.g.,  differential  diagnosis. 
That  such  a  decision  rule  fails  to  be  an  ideal  observer 
decision  rule  may  be  considered  surprising,  given  the 
properties  the  Mossman  decision  rule  shares  with  that  of 
Chan  et  al. — in  particular,  the  identity  of  two  out  of  the 
three  decision  boundary  lines.  The  reasons  why  one 
decision  rule  can  be  said  to  correspond  to  ideal  observer 
behavior,  while  a  rule  similar  in  structure  does  not  when 
used  with  a  particular  and  obvious  choice  of  decision 
variables,  are  connected  to  fundamental  constraints  on  the 
ideal  observer’s  behavior;  given  the  inherent  complexities 
of  the  three-class  classihcation  task,  it  is  easy  for  such 
subtleties  to  be  overwhelmed  by  other  details.  A  close 
comparison  of  two  possible  three-class  classihcation 
decision  rules  can  thus  provide  an  immediate  and  intuitive 
understanding  of  such  properties,  even  though  a  complete 
and  fully  general  solution  to  the  three-class  classihcation 
problem  remains  elusive. 
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Abstract — We  are  attempting  to  develop  expressions  for  the  co¬ 
ordinates  of  points  on  the  three-class  ideal  observer’s  receiver  op¬ 
erating  characteristic  (ROC)  hypersurface  as  functions  of  the  set 
of  decision  criteria  used  by  the  ideal  observer.  This  is  considerably 
more  difficult  than  in  the  two-class  classification  task,  because  the 
conditional  probabilities  in  question  are  not  simply  related  to  the 
cumulative  distribution  functions  of  the  decision  variables,  and  be¬ 
cause  the  slopes  and  intercepts  of  the  decision  boundary  lines  are 
not  independent;  given  the  locations  of  two  of  the  lines,  the  location 
of  the  third  will  be  constrained  depending  on  the  other  two.  In  this 
paper,  we  attempt  to  characterize  those  constraining  relationships 
among  the  three-class  ideal  observer’s  decision  boundary  lines.  As 
a  result,  we  show  that  the  relationship  between  the  decision  criteria 
and  the  misclassification  probabilities  is  not  one-to-one,  as  it  is  for 
the  two-class  ideal  observer. 

Index  Terms — Ideal  observers,  ROC  analysis,  three-class  classi¬ 
fication. 


I.  Introduction 

RECEIVER  operating  characteristic  (ROC)  analysis  is  the 
accepted  methodology  for  analyzing  the  performance  of 
a  two-class  classifier  [1],  in  particular  for  medical  decision¬ 
making  tasks  in  which  a  patient  is  diagnosed  as  having  or  not 
having  a  particular  condition  based  on  features  of  a  medical 
image  [2],  In  judging  the  performance  of  an  observer  measured 
via  ROC  analysis,  the  standard  for  comparison  is  the  so-called 
ideal  observer,  that  observer  which  outperforms  any  other  pos¬ 
sible  observer  given  the  statistical  variability  of  the  observa¬ 
tional  data  being  classified  [1],  [3].  Although  the  general  form 
of  the  ideal  observer  in  a  classification  task  with  three  or  more 
classes  has  been  known  for  some  time  [3],  the  considerable  com¬ 
plexities  inherent  to  this  model  compared  to  the  two-class  clas¬ 
sification  task  have  hampered  the  development  of  extensions 
of  ROC  analysis  which  are  both  fully  general  and  practically 
useful.  (Several  researchers  have  recently  proposed  restricted 
observer  models  or  restricted  evaluation  methods  [4] — [7].) 

Despite  these  difficulties,  research  continues  in  this  area  be¬ 
cause  the  advantages  to  be  gained  from  a  three-class  classifier 
and  appropriate  evaluation  methodology  are  considerable.  In 
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our  own  case,  we  seek  to  combine  existing  computer-aided  di¬ 
agnosis  (CAD)  schemes  for  detecting  [8]— [12]  mammographic 
mass  lesions  and  classifying  [13]— [17]  them  as  malignant  or  be¬ 
nign.  The  combined  scheme  would  serve  as  a  fully  automated 
classifier  (the  existing  classifier  requires  initial  manual  identifi¬ 
cation  of  lesions  by  a  radiologist),  potentially  allowing  radiolo¬ 
gists  to  reduce  their  false-positive  biopsy  rate  without  reducing 
their  sensitivity  for  detection  of  malignancies.  Simply  concate¬ 
nating  the  two  types  of  scheme  in  a  two-stage  classifier  would  be 
inadequate,  because  the  output  of  the  detection  scheme  will  nec¬ 
essarily  include  false-positive  (FP)  computer  detections  in  addi¬ 
tion  to  the  malignant  and  benign  lesions  to  be  classified.  These 
FP  computer  detections  correspond  to  objects  which  were  by 
design  not  included  in  the  training  sample  of  the  classification 
scheme,  because  they  are  not  members  of  the  data  population 
(benign  and  malignant  mass  breast  lesions)  for  which  the  clas¬ 
sification  scheme  was  created.  It  is  clear  then  that  the  detection 
scheme’s  output  cannot  be  used  unmodified  as  the  input  to  the 
classification  scheme. 

Our  initial  efforts  toward  the  goal  of  developing  a  true 
three-class  classifier  have  been  more  theoretical  than  practical 
so  far.  We  have  shown  that,  just  as  the  two-class  ideal  observer 
achieves  the  optimal  two-class  ROC  curve  for  a  given  task, 
the  A-class  ideal  observer  achieves  the  optimal  A-class  ROC 
hypersurface  [18].  (Note  that  the  ideal  observer  is  formally 
defined  as  that  which  minimizes  the  expected  Bayes  risk  [3], 
and  not  in  terms  of  classification  performance,  making  this 
a  nontrivial  observation  in  both  cases.)  More  soberingly,  we 
found  recently  that  an  obvious  generalization  of  the  well-known 
performance  metric,  the  area  under  the  ROC  curve  (AUC),  is 
not  a  useful  performance  metric  in  a  classification  task  with 
three  or  more  classes  [19]. 

At  present  we  are  attempting  to  develop  expressions  for  the 
coordinates  of  points  on  the  three-class  ideal  observer’s  ROC 
hypersurface  (the  conditional  probabilities  for  misclassifying 
observations  [18],  [20],  [21])  as  functions  of  the  set  of  decision 
criteria  used  by  the  ideal  observer.  This  is  considerably  more 
difficult  than  in  the  two-class  classification  task  for  two  reasons. 
First,  the  conditional  probabilities  in  question  are  not  simply  re¬ 
lated  to  the  cumulative  distribution  functions  (cdfs)  of  the  deci¬ 
sion  variables,  but  are  integrals  of  those  variables  over  domains 
determined  by  three  decision  boundary  lines  [3].  Second,  the 
slopes  and  intercepts  of  the  decision  boundary  lines  are  not  inde¬ 
pendent;  given  the  locations  of  two  of  the  lines,  we  have  found 
recently  that  the  location  of  the  third  will  be  constrained  de¬ 
pending  on  the  other  two. 

In  this  paper,  we  attempt  to  characterize  the  constraining  rela¬ 
tionships  just  mentioned  among  the  three-class  ideal  observer’s 
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decision  boundary  lines.  Although  this  paper  is  admittedly  still 
removed  from  image  analysis  perse,  we  hope  it  may  prove  of 
interest  to  the  CAD  community  and  ultimately  of  relevance  to  a 
wide  variety  of  medical  image  analysis  tasks.  In  the  next  section 
we  briefly  review  the  structure  of  the  three-class  ideal  observer 
and  the  notation  we  have  been  using  to  characterize  it  [18].  In 
Section  III,  we  show  that  for  a  given  location  (slope  and  y- inter¬ 
cept)  of  the  decision  boundary  line  separating  the  first  and  third 
classes,  the  location  of  one  of  the  remaining  two  lines  is  con¬ 
strained  in  a  particular  way  based  on  the  location  of  the  other. 

These  results  are  discussed  in  Section  IV.  Given  the  arbitrari¬ 
ness  of  the  labels  applied  to  the  three  classes  (ie,  which  classes 
are  considered  first,  second,  or  third),  one  would  expect  the  se¬ 
lection  of  the  fixed  line  in  Section  III  to  be  similarly  arbitrary, 
and  indeed  in  Appendices  A  and  B  we  show  that  corresponding 
and  consistent  results  are  obtained  if  one  takes  the  location  of 
the  decision  boundary  line  separating  the  second  and  third,  or 
first  and  second,  classes,  respectively,  to  be  given. 

II.  The  Three-Class  Ideal  Observer 

In  [18],  we  showed  that  an  .V-class  ideal  observer  makes  de¬ 
cisions  by  partitioning  a  likelihood  ratio  decision  variable  space, 
where  the  boundaries  of  the  partitions  are  given  by  hyperplanes 

tv— 1 

decide  d  =  tt;  iff  ^  {U^  -  Uj\k)P{t  =  7rfc)LRfc 

k= 1 

>  (Uj\N  -  Ui[N)P(t  =  7rN )  {j  <  i}  (1) 

N-l 

and  ^  (Uj\k  -  Uj\k)P(t  =  7rfc)LRfc 

k= 1 

>  (^j|tv  -  Ui\N)P(t  =  7rN )  { j  >  *}.  (2) 

Here,  Lyy  is  the  utility  of  deciding  an  observation  is  from  class 
7 Ti  given  that  it  is  actually  from  class  717 ;  P(t  =  Tik)  is  the  apriori 
probability  that  an  observation  is  drawn  from  class  77;  and  LR/, 
is  the  A:th  likelihood  ratio,  defined  by  the  ratio p(x\nk) /p(x\ttn) 
of  the  probability  density  functions  of  the  observational  data 
(We  use  boldface  type  to  denote  random  variables).  The  par¬ 
titioning  is  determined  by  the  parameters 


' Yijk  —  (P i\k  Pj\k)  —  TTfc)  (3) 

with  i,  j,  and  k  varying  from  1  to  N,  and  j  i.  Note  that  these 
parameters  are  not  independent,  however,  because 


T/./t'  —  7fcjfc  7fcifc  •  (4) 

We  can  impose  the  reasonable  condition  that  the  utility  for 
correctly  classifying  an  observation  from  a  given  class  should  be 
greater  than  any  utility  for  incorrectly  classifying  an  observation 
from  the  same  class,  i.e.,  U,;i;  >  Uj\i  {i  ^  j}.  This  gives,  for 

j  +  h 


liji  >0  (5) 

leaving  N(N  —  1)  positive  parameters  (the  rest  are  derivable 
from  (4)). 

Finally,  note  that  the  hyperplanes  represented  by  (1)  and  (2) 
are  unchanged  if  we  multiply  all  of  these  equations  by  a  single 


scalar,  such  as  1  This  leaves  us  with  N2  —  N  —  1 

degrees  of  freedom,  as  expected. 

The  behavior  of  a  three-class  ideal  observer  is  completely 
determined  by  the  three  decision  boundary  lines 


7l2lLRl  —  7212LR2  =7313  —  7323  (6) 

7131  LRl  +  (7232  —  7212)LR2  =7313  (7) 

(7131  —  7l2l)LRi  +  7232 LR2  =7323  (8) 


which  we  call,  respectively,  the  “l-vs-2”  line,  the  “l-vs-3”  line, 
and  the  “2-vs-3”  line.  Note  that  if  any  two  of  these  lines  inter¬ 
sect,  the  third  line  must  also  share  this  intersection  point.  We 
also  emphasize  the  simple  interpretation,  from  (3),  of  each  of  the 
7 iji  parameters  appearing  in  these  decision  boundary  line  equa¬ 
tions  as  the  difference  in  utilities  between  a  “correct”  and  one 
particular  “incorrect”  decision  (scaled  by  the  apriori  probability 
of  the  true  class  in  question);  and  of  each  difference  in  the  7 ^ 
parameters  as  a  difference  in  utilities  between  two  possible  “in¬ 
correct”  decisions  [again  scaled  by  the  apriori  probability  of  the 
true  class  in  question;  e.g„  7313  -  7323  =  (C/2|3  -  Ui\3)P(t  = 
713)]- 

From  the  conditions  on  the  7 77  parameters  in  (5),  we  can 
readily  derive  conditions  on  the  decision  boundaries  themselves. 
If  we  denote  the  slope  of  the  “i-vs-j”  line  by  ml:l ,  its  y-intercept 
by  bij,  and  its  ^-intercept  by  Xij ,  we  have 


7121 

m\ 2  = - 

7212 

7313 

Xl3  ~  - - 

7131 
,  7323 

t>23  =  - 

7232 


>  0 
>  0 
>  0. 


(9) 

GO) 

(11) 


These  are  the  three  conditions  stated  in  [22]. 


III.  Restrictions  Determined  by  the  Parameters  of  the 
“1-VS.-3”  Line 


Constraints  on  the  decision  boundaries,  in  addition  to  those 
given  in  (9)-(l  1),  can  be  obtained  by  considering  the  two  cases 
7232  -  7212  >  0  and  7232  -  7212  <  0.  In  the  first  case  (ie, 
7232  >  7212,  or  E7i|2  >  U3 12),  we  have 


We  also  have 


m  13  = 
^13  = 


—7131 

<0 

(12) 

7232  —  7212 

7313 

>0. 

(13) 

7232  —  7212 

m2  3 


—(7131  ~  7121) 

7232 

(7232  ~  7212)11113  +  721271112 
7232 


7212  \  .  7212 

-  mi3  -I - mi2 

7232  J  7232 


(14) 


This  is  a  weighted  sum  of  the  slopes  m.12  and  m.13,  where  the 
weights  are  positive  and  sum  to  one.  Since  we  must  have  rn.  \  3  < 
rri  \  9  from  (9)  and  (12),  it  must  therefore  be  the  case  that 


mi3  <  m2 3  <  mi2. 


(15) 
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Fig.  1.  Example  ideal  observer  decision  rules  for  the  case  7232  —  7212  >  0 
(implying  m 13  <  0  and  613  >  0)  and  612  <  0.  In  (a),  X12  <  Xi3>  and 
the  “2-vs-3”  line  can  lie  anywhere  between  the  two  dashed  lines  shown  (the 
region  between  the  lower  dashed  and  dotted  lines  is  excluded  because  623  >0); 
observations  in  the  unlabeled  region  above  this  line  will  be  decided  “7r2 and 
those  below  this  line  will  be  decided  “7r3.”  In  (b),  X12  ^  X13  and  the  “2-vs-3” 
line  can  lie  anywhere  in  the  unlabeled  region  (provided  it  shares  the  intersection 
point  of  the  “l-vs-2”  and  “l-vs-3”  lines  shown);  observations  above  this  line  will 
be  decided  “7r2 and  those  below  this  line  will  be  decided  “7r3 


We  now  consider  the  case  7232  —  7212  <  0  (ie,  7232  <  7212* 
or  Ui |2  <  C/3|2),  which  yields 


We  now  have 


mis  = 
bi3  = 


-7131 

>  0 

(18) 

7232  -  7212 

7313 

<  0. 

(19) 

7232  —  7212 

m  12 


7121 

7212 

7131  —  (7131  —  7l2l) 

7212 

—  (7232  ~  7212)l?ll3  +  723211123 
7212 


7232  \  .  7232 

-  m  13  H - m23 

7212  J  7212 


(20) 


Fig.  2.  Example  ideal  observer  decision  rules  for  the  case  7232  -7212  >  0 
(implying  m13  <  0  and  613  >  0)  and  fr12  >  0.  In  (a),  612  <  b13,  and  the 
“2-vs-3”  line  can  lie  anywhere  in  the  unlabeled  region;  observations  above  this 
line  will  be  decided  “7r2,”  and  those  below  this  line  will  be  decided  “7r3.”  In 
(b),  6i2  >  6i3  and  the  “2-vs-3”  line  can  lie  anywhere  between  the  “l-vs-2”  and 
“l-vs-3”  lines  (provided  it  shares  their  intersection  point);  note  that  observations 
in  this  region  will  be  decided  “7Ti”  regardless  of  the  position  of  this  line. 


Furthermore 


This  is  again  a  weighted  sum  in  which  the  weights  are  positive 
and  sum  to  one,  giving 


min(mx3,m23)  <  mi 2  <  max(mi3,m23).  (21) 


Furthermore 


bi2 


7313  -  7323 
-7212 

—7313  +  7323 
7212 

—  (7232  —  7212)^13  +  7232^23 
7212 


b\  3  + 


7232 

7212 


b‘23- 


(22) 


This  is  a  weighted  sum  of  the  ^-intercepts  ht  3  and  623,  where  the 
weights  are  positive  and  sum  to  one;  thus,  in  addition  to  (2 1 ),  we 
have  the  condition 


&23  — 


7323 

7232 

7313  -  (7313  —  7323) 


7232 

(7232  ~  7212)^13  +  7212fol2 
7232 


=  (1  _^U,  +  2*E 


7232  J 


b\  3  H - Oi2- 

7232 


06) 


This  is  a  weighted  sum  of  the  y- intercepts  612  and  &13,  where  the 
weights  are  positive  and  sum  to  one;  thus,  in  addition  to  (15),  we 
have  the  condition 


min(&i2, 613)  <  623  <  max(&i2, 613).  (17) 


C>13  <  bi2  <  623  (23) 

since  613  <  &23  by  (11)  and  (19). 

If  m23  <  0,  then  (21)  immediately  reduces  to  m2 3  <  m i2  < 
mi3  (by  (18),  we  are  considering  a  special  case  in  which  77113  > 
0).  This  is  illustrated  in  Fig.  3  for  the  slightly  different  situations 
X\  3  <  X23  and  Xi3  >  X23-  If,  on  the  other  hand,  m23  >  0,  then 
(21)  and  (23)  together  imply  two  possible  situations,  depending 
on  whether  m23  <  mi3  or  m2 3  >  mi3-  These  possibilities  are 
illustrated  in  Fig.  4. 

One  may  of  course  ask  what  happens  when  ^232  —  7212  =  0 
(ie,  7232  =  72i2,  or  C/i|2  =  U^)-  In  this  case,  both  771.13  and 
613  are  infinite.  Furthermore 


If  bi2  <  0,  then  (17)  immediately  reduces  to  /i!  2  <  623  <  h  t  3 
(by  (13),  we  are  considering  a  special  case  in  which  bi3  >  0). 
This  is  illustrated  in  Fig.  1  for  the  slightly  different  situations 
X12  <  Xi3  and  X\ 2  >  Xi3-  If,  on  the  other  hand,  612  >  0,  then 
(15)  and  (17)  together  imply  two  possible  situations,  depending 
on  whether  b\2  <  613  or  &i2  >  613.  These  possibilities  are 
illustrated  in  Fig.  2. 


-7131  . 

= - b  mu 

7232 

<77112 


(24) 
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Fig.  3.  Example  ideal  observer  decision  rules  for  the  case  7232  —  7212  <  0 
(implying  77713  >  0  and  613  <  0)  and  m23  <  0.  In  (a),  X13  <  X23,  and  the 
“l-vs-2”  line  can  lie  anywhere  between  the  two  dashed  lines  shown  (the  region 
between  the  lower  dashed  and  dotted  lines  is  excluded  because  mi2  >  0); 
observations  in  the  unlabeled  region  above  this  line  will  be  decided  “7r2 and 
those  below  this  line  will  be  decided  “7Ti.”  In  (b),  X13  >  X23  and  the  “l-vs-2” 
line  can  lie  anywhere  in  the  unlabeled  region  (provided  it  shares  the  intersection 
point  of  the  “l-vs-3”  and  “2-vs-3”  lines  shown);  observations  above  this  line  will 
be  decided  “7T2”,  and  those  below  this  line  will  be  decided  “7Ti” 


LRi  LRi 

(a)  (b) 


Fig.  4.  Example  ideal  observer  decision  rules  for  the  case  7232  —  7212  <  0 
(implying  mi3  >  0  and  613  <  0)  and  m23  >  0.  In  (a),  77223  <  77213 ,  and  the 
“l-vs-2”  line  can  lie  anywhere  in  the  unlabeled  region;  observations  above  this 
line  will  be  decided  “^2”,  and  those  below  this  line  will  be  decided  In  (b), 
77223  >  77213 ,  and  the  “l-vs-2”  line  can  lie  anywhere  between  the  “l-vs-3”  and 
“2-vs-3”  lines  (provided  it  shares  their  intersection  point);  note  that  observations 
in  this  region  will  be  decided  “773”  regardless  of  the  position  of  this  line. 


and 


7323  ~  7313 
7212 

7323  —7313 
7232  7212 


—  ^23  + 


—7313 

7212 


<  ^23- 


(25) 


Together,  (24)  and  (25)  can  be  considered  either  a  special  case 
of  the  inequalities  (15)  and  (17),  if  we  take  mi.3  =  —  oo  and 
(>13  =  +oo;  or  of  the  inequalities  (21)  and  (23),  if  we  take 
mi3  =  +oo  and  613  =  —00.  This  situation,  for  the  slightly 
different  cases  612  <  0  and  h±2  >  0,  is  illustrated  in  Fig.  5. 

In  this  section,  the  possible  values  of  the  quantity  7232  —  7212 
were  considered  in  order  to  determine  properties  of  the  ideal  ob¬ 
server  decision  boundary  lines.  It  may  be  argued  that  the  choice 
of  a  parameter  from  the  “l-vs-3”  line,  i.e.,  one  of  the  three  avail¬ 
able  lines,  must  be  an  arbitrary  one.  In  fact,  we  may  consider 
taking  another  parameter  (or  combination  of  parameters)  from 
(6)-(8),  and  using  it  to  determine  conditions  on  the  properties 


(a)  (b) 


Fig.  5.  Example  ideal  observer  decision  rules  for  the  case  7232  —  7212  =  0 
(implying  77713  =  4=00  and  613  =  ±00).  In  (a),  612  <  0  and  the  “2-vs-3”  line 
can  lie  anywhere  between  the  two  dashed  lines  shown  (the  region  between  the 
lower  dashed  and  dotted  lines  is  excluded  because  623  >0);  observations  in  the 
unlabeled  region  above  this  line  will  be  decided  “7r2 ,”  and  those  below  this  line 
will  be  decided  “7T3.”  In  (b),  612  >  0  and  the  “2-vs-3”  line  can  lie  anywhere 
in  the  unlabeled  region;  observations  above  this  line  will  be  decided  “7r2,”  and 
those  below  this  line  will  be  decided  “^3 .” 

of  the  decision  boundary  lines  as  above.  Given  that  all  possible 
values  of  the  quantity  7232  —7212  were  considered,  it  is  expected 
that  no  new  conditions  should  be  determinable  (let  alone  con¬ 
ditions  inconsistent  with  those  already  determined).  In  fact,  this 
can  readily  be  shown  to  be  the  case;  however,  due  to  the  repet¬ 
itive  nature  of  the  derivations  involved,  these  are  relegated  to 
Appendices  A  and  B. 

IV.  Discussion  and  Conclusion 

The  repetitive  nature  of  the  algebraic  manipulations  given  in 
the  preceding  section  and  the  Appendices  should  not  be  allowed 
to  distract  from  the  fundamental  point  being  made:  given  the 
locations  of  two  of  the  decision  boundary  lines,  the  location 
of  the  third  is  not  completely  arbitrary.  That  is,  aside  from  the 
obvious  [given  (6)— (8)]  constraint  that  the  lines  must  share  a 
common  intersection  point,  it  can  also  be  shown  that  the  slope 
of  the  third  line  is  constrained  by  the  slopes  of  the  first  two. 

The  significance  of  this  result  may  be  difficult  to  appreciate 
at  first  glance.  It  is  perhaps  best  illustrated  by  comparison  with 
the  two-class  classifier,  for  which  the  ROC  operating  point  coor¬ 
dinates  [e.g.,  the  true-positive  fraction  (TPF)  and  false-positive 
fraction  (FPF)]  are  determined  by  a  single  decision  criterion  7, 
which  is  free  to  vary  without  restriction  throughout  its  domain 
of  definition.  For  the  two-class  ideal  observer,  in  particular,  an 
observation  is  decided  “positive”  (assigned  to  the  class  7Ti)  if 
LRi  >  7,  where  7  can  take  on  any  nonnegative  value.  Further¬ 
more,  the  FPF  and  TPF  are  related  in  a  very  simple  way  to  the 
cdfs  of  LRi ,  and  are  thus  monotonic  in  the  decision  criterion  7. 
For  the  three-class  ideal  observer,  this  straightforward  relation¬ 
ship  is  lost;  indeed.  Figs.  2(b),  4(b),  7(b),  9(b),  12(b),  and  14(b) 
show  that  for  certain  values  of  four  of  the  five  decision  criteria 
7 iji,  the  misclassification  probabilities  (ie,  the  ROC  operating 
point  coordinates)  can  be  independent  of  the  fifth  decision  cri¬ 
terion. 

More  succinctly,  the  relationship  between  the  decision  cri¬ 
teria  and  the  misclassification  probabilities  is  not  one-to-one, 
as  it  is  for  the  two-class  ideal  observer.  A  correct  formulation 
of  the  misclassification  probabilities  as  functions  of  the  deci¬ 
sion  criteria — necessary  for  an  explicit  calculation  of  the  ideal 
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LRi  LRi 

(a)  (b) 


Fig.  6.  Example  ideal  observer  decision  rules  for  the  case  7131  —  712i  >  0 
(implying  l/m23  <  0  and  X23  >  0)  and  X12  <  0-  In  (a),  6i2  <  623,  and 
the  “l-vs-3”  line  can  lie  anywhere  between  the  two  dashed  lines  shown  (the 
region  between  the  left  dashed  and  dotted  lines  is  excluded  because  X13  >  0); 
observations  in  the  unlabeled  region  to  the  right  of  this  line  will  be  decided  “7Ti 
and  those  to  the  left  of  this  line  will  be  decided  “7r3.”  In  (b),  bi2  >  623  and  the 
“l-vs-3”  line  can  lie  anywhere  in  the  unlabeled  region  (provided  it  shares  the 
intersection  point  of  the  “l-vs-2”  and  “2-vs-3”  lines  shown);  observations  to  the 
right  of  this  line  will  be  decided  “tti  and  those  to  the  left  of  this  line  will  be 
decided  “7r3.” 


observer’s  ROC  hypersurface  given  the  decision  variable  prob¬ 
ability  density  functions — will  require  careful  consideration  of 
this  issue.  Although  we  have  shown  previously  that  the  hyper¬ 
volume  under  the  ROC  hypersurface  is  not  a  useful  performance 
metric  in  general  [19],  it  is  still  the  case  that  the  ROC  hyper¬ 
surface  in  terms  of  the  set  of  misclassification  probabilities  (six 
in  the  three-class  classification  task)  is  a  complete  description 
of  observer  performance.  We  expect  that  a  useful  performance 
metric,  assuming  one  exists,  will  be  derived  in  some  fashion 
from  the  ROC  hypersurface.  It  is  thus  important  to  develop  a 
complete  understanding  of  the  rather  complicated  relationships 
among  the  quantities  involved,  and  we  hope  that  this  paper  will 
prove  of  some  use  toward  this  goal. 


Appendix  A 

Restrictions  Determined  by  the  Parameters  of  the 
“2-VS.-3”  Line 

Consider  the  quantity  7131  —7121  from  (  8).  In  particular,  when 
7131  -  7121  >  0  (ie,  7i3i  >  7121,  or  U2 \i  >  f73|i),  we  have 


1 

m23 

X23 


—7232 

7131  —  7121 
7323 

7131  —  7121 


<  0 
>  0. 


(26) 

(27) 


Through  reasoning  similar  to  that  of  Section  III,  we  also  have 


111 

-  <  -  <  - 

m  23  mi  3  mi  2 


(28) 


LRi  LRj 

(a)  (b) 


Fig.  7.  Example  ideal  observer  decision  rules  for  the  case  7131  —  7121  >0 
(implying  l/m23  <  0  and  X23  >  0)  and  X12  >  0-  In  (a),  X12  <  X23,  and 
the  “l-vs-3”  line  can  lie  anywhere  in  the  unlabeled  region;  observations  to  the 
left  of  this  line  will  be  decided  “7Ti,”  and  those  to  the  right  of  this  line  will  be 
decided  “7r3 .”  In  (b),  X12  >  X23  and  the  “l-vs-3”  line  can  lie  anywhere  between 
the  “l-vs-2”  and  “2-vs-3”  lines  (provided  it  shares  their  intersection  point);  note 
that  observations  in  this  region  will  be  decided  “7r2”  regardless  of  the  position 
of  this  line. 


Fig.  8.  Example  ideal  observer  decision  rules  for  the  case  7131  —  7121  <0 
(implying  l/m23  >  0  and  X23  <  0)  and  l/m13  <  0.  In  (a),  623  <  ^13 » 
and  the  “l-vs-2”  line  can  lie  anywhere  between  the  two  dashed  lines  shown 
(the  region  between  the  vertical  dashed  and  dotted  lines  is  excluded  because 
m  12  >  0  and,  therefore,  l/mi2  >  0);  observations  in  the  unlabeled  region 
above  this  line  will  be  decided  “7r2 ,”  and  those  below  this  line  will  be  decided 
“717.”  In  (b),  623  >  613  and  the  “l-vs-2”  line  can  lie  anywhere  in  the  unlabeled 
region  (provided  it  shares  the  intersection  point  of  the  “l-vs-3”  and  “2-vs-3” 
lines  shown);  observations  above  this  line  will  be  decided  “7r2”,  and  those  below 
this  line  will  be  decided  “7Ti .” 


h  \  2  <  62.3  and  bi2  >  62.3-  If.  on  the  other  hand,  X12  >  0,  then 
(28)  and  (29)  together  imply  two  possible  situations,  depending 
on  whether  X12  <  X23  or  X\  2  >  X23-  These  possibilities  are 
illustrated  in  Fig.  7. 

If  7131 -7121  <  0  (ie,  7i3i  <  7i2i,orC/2|i  <  C/3|i),  we  have 

—  =  ~7232  >  0  (30) 

11123  7131  -  7121 

7323  ,,,, 

X23  =  -  <  0.  (31) 

7131  -  7121 

One  can  also  show 

min  (  — — ,  — —  )  <  — —  <  max  (  ,  — —  )  (32) 

\mi3  m23y  mi2  Vm13  "'23  / 


and 


and 


min(xi2,X23)  <  X13  <  max(xi2,  *23)-  (29) 

If  X12  <  0.  then  (29)  immediately  reduces  to  X12  <  Xi  3  < 
X23  (by  (27),  we  are  considering  a  special  case  in  which  \'23  > 
0).  This  is  illustrated  in  Fig.  6  for  the  slightly  different  situations 


X23  <  X12  <  Xi3-  (33) 

If  l/mi3  <  0,  then  (32)  immediately  reduces  to  1/Vni3  < 
l/m.12  <  1  / 77123  (by  (30),  we  are  considering  a  special  case  in 
which  l/m,23  >  0).  This  is  illustrated  in  Fig.  8  for  the  slightly 
different  situations  bo 3  <  b  13  and  b23  >  613.  If,  on  the  other 
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Fig.  9.  Example  ideal  observer  decision  rules  for  the  case  7131  —  7121  <  0 
(implying  l/m23  >  0  and  X23  <  0)  and  l/mi3  >  0.  In  (a), 

l/mi3  <  l/m23 »  and  the  “l-vs-2”  line  can  lie  anywhere  in  the  unlabeled 
region;  observations  above  this  line  will  be  decided  “7r2 and  those  below  this 
line  will  be  decided  “7Ti .”  In  (b),  l/mi3  >  l/m23  and  the  “l-vs-2”  line  can 
lie  anywhere  between  the  “l-vs-3”  and  “2-vs-3”  lines  (provided  it  shares  their 
intersection  point);  note  that  observations  in  this  region  will  be  decided  “773” 
regardless  of  the  position  of  this  line. 


(a)  (b) 

Fig.  10.  Example  ideal  observer  decision  rules  for  the  case  7131  —  7121  =  0 
(implying  l/m23  =  =|=oo  and  X23  =  ±00).  In  (a),  X12  <  0,  and  the  “l-vs-3” 
line  can  lie  anywhere  between  the  two  dashed  lines  shown  (the  region  between 
the  leftmost  dashed  and  dotted  lines  is  excluded  because  X13  >  0);  observations 
in  the  unlabeled  region  to  the  right  of  this  line  will  be  decided  “7Ti ,”  and  those 
to  the  left  of  this  line  will  be  decided  “7r3.”  In  (b),  X12  ^  0  and  the  “l-vs-3” 
line  can  lie  anywhere  in  the  unlabeled  region;  observations  to  the  right  of  this 
line  will  be  decided  “7^ ,”  and  those  to  the  left  of  this  line  will  be  decided  “7r3 .” 


values  of  the  undetermined  decision  boundary  parameter  being 
illustrated  in  that  figure).  Specifically 


Fig.  6(a) 

=> 

Figs.  2(a),  3(a),  5(b) 

Fig.  6(b) 

=> 

Fig.  2(b) 

Fig.  7(a) 

=> 

Figs.  1(a),  3(a),  5(a) 

Fig.  7(b) 

Figs.  1(b),  3(b),  5(a) 

Fig.  8(a) 

Figs.  1(a),  2(a) 

Fig.  8(b) 

Fig.  2(b) 

Fig.  9(a) 

Figs.  4(a),  5(a),  5(b) 

Fig.  9(b) 

=> 

Fig.  4(b) 

Fig.  10(a) 

Figs.  2(a),  4(a),  5(b),  2(b) 

Fig.  10(b) 

Figs.  1(a),  4(a),  5(a). 

That  is,  none  of  the  conditions  derived  in  this  section  are  in¬ 
consistent  with  those  derived  Section  III.  More  importantly,  note 
the  symmetry  between  the  corresponding  equations  and  figures 
in  Section  III  and  this  appendix,  if  one  “swaps”  the  labels  of 
classes  7ti  and  ir2,  and  additionally  replaces  niij  with  1  /m^y, 
Xij  with  bi'f,  and  bij  with  Xi'f  d'  =  1  if  i  =  2,  2  if  i  =  1,  and 
3  if  *  =  3;  similarly  for  j).  Intuitively,  if  one  “flips”  the  figures 
in  one  section  about  the  y  =  x  line,  one  obtains  the  figures  in 
the  other  section. 


Appendix  B 

Restrictions  Determined  by  the  Parameters  of  the 
“1-VS.-2”  Line 

In  this  appendix,  we  consider  the  possible  values  of  the  quan¬ 
tity  7313  —  7323-  As  in  the  preceding  Appendix,  we  expect  to 
obtain  no  conditions  inconsistent  with  those  already  derived. 

When  7313  -  j323  >  0  (ie,  7313  >  7323,  or  U2\3  >  C/i|3),  we 
have 


1 

_  —7212 

<  0 

bn 

7313  —  7323 

1 

_  7121 

>  0. 

X12 

7313  —  7323 

(36) 

(37) 


hand,  l/mi3  >  0,  then  (32)  and  (33)  together  imply  two  pos¬ 
sible  situations,  depending  on  whether  l/mi3  <  l/m23  or 
1  jm,\  3  >  l/m23-  These  possibilities  are  illustrated  in  Fig.  9. 

Finally,  we  consider  the  case  7131  —  7121  =  0  (7131  =  7121 
or  U2\i  =  C/311),  in  which  both  l/m23  and  X23  are  infinite.  We 
now  have 


mis  mi  2 


and 

X12  <  Xi3-  (35) 

Together,  (34)  and  (35)  can  be  considered  either  a  special 
case  of  the  inequalities  (28)  and  (29),  if  we  take  l/m23  =  —00 
and  X23  =  +00;  or  of  the  inequalities  (32)  and  (33),  if  we  take 
l/m23  =  +00  and  X23  =  —00.  This  situation,  for  the  slightly 
different  cases  X12  <  0  and  X12  >  0,  is  illustrated  in  Fig.  10. 

Notice  that  every  figure  in  this  appendix  has  one  or  more 
corresponding  figures  in  Section  III  (depending  on  the  possible 


Through  reasoning  similar  to  that  of  Section  III,  we  also  have 


111 

—  <  —  <  — 
bi2  bi3  b23 

and 

.  (  1  1  \  1 

mm  - , -  <  -  <  max 

\X23  Xl2  J  Xl  3 


\X23  Xl2 


(38) 


(39) 


If  1/ X23  <  0,  then  (39)  immediately  reduces  to  l/x'23  < 
I/X13  <  1  /x<2  (by  (37),  we  are  considering  a  special  case  in 
which  1/ X12  >  0).  This  is  illustrated  in  Fig.  11  for  the  slightly 
different  situations  m23  <  m i2  and  m2 3  >  m \2.  If,  on  the 
other  hand,  I/X23  >  0,  then  (38)  and  (39)  together  imply  two 
possible  situations,  depending  on  whether  I/X23  <  1  /x\ 2  or 
1  / X23  /  1  /x  \  2-  These  possibilities  are  illustrated  in  Fig.  12. 

If  7313  — 7323  <  0  (ie,  73x3  <  7323.  or  U2 13  <  Ui |3),  we  have 


1 

bi2 

1 

X12 


—7212 
7313  —  7323 
7121 

7313  —  7323 


>  0 
<  0. 


(40) 

(41) 
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(a)  (b) 

Fig.  15.  Example  ideal  observer  decision  rales  for  the  case  3313  —  7323  =  0 
(implying  l/b12  =  f  x  and  I  /  \  ,  2  =  ±00).  In  (a),  I/&13  <  0,  and  the 
“2-vs-3”  line  can  lie  anywhere  between  the  two  dashed  lines  shown  (the  region 
between  the  vertical  dashed  and  dotted  lines  is  excluded  because  I/&23  >  0); 
observations  in  the  unlabeled  region  to  above  this  line  will  be  decided  '‘tt2 and 
those  below  this  line  will  be  decided  "^3.”  In  (b),  1/(33  >  0,  and  the  “2-vs-3” 
line  can  lie  anywhere  in  the  unlabeled  region;  observations  above  this  line  will 
be  decided  "7r2,”  and  those  below  this  line  will  be  decided  “7r3.” 

values  of  the  undetermined  decision  boundary  parameter  being 
illustrated  in  that  figure).  Specifically 


Fig.  11(a) 

Figs.  1(a),  4(a),  5(a) 

Fig.  11(b) 

Fig.  4(b) 

Fig.  12(a) 

Figs.  1(a),  3(a),  5(a) 

Fig.  12(b) 

Figs.  1(b),  3(b),  5(a) 

Fig.  13(a) 

Figs.  3(a),  4(a),  5(b) 

Fig.  13(b) 

Fig.  4(b) 

Fig.  14(a) 

Fig.  2(a) 

Fig.  14(b) 

Fig.  2(b) 

Fig.  15(a) 

=> 

Figs.  3(a),  4(a),  5(b) 

Fig.  15(b) 

Figs.  2(a),  3(a),  4(b). 

That  is,  none  of  the  conditions  derived  in  this  appendix 
are  inconsistent  with  those  derived  in  Section  III  or  Ap¬ 
pendix  A.  More  importantly,  note  the  symmetry  between  the 
corresponding  equations  and  figures  in  Sections  III  and  this 
appendix,  if  one  “swaps”  the  labels  of  classes  iro  and  7T.3,  and 
additionally  replaces  rrijj  with  1  /Xi'j’’  Xij  with  l/m^y,  and 
bij  with  1  jbi'jt  (i'  =  1  if  i  =  1,  2  if  i  =  3,  and  3  if  %  =  2; 
similarly  for  j ). 
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ABSTRACT 

We  have  shown  in  previous  work  that  an  ideal  observer  in  a  classification  task  with  TV  classes  achieves  the  optimal 
receiver  operating  characteristic  (ROC)  hypersurface  in  a  Neyman-Pearson  sense.  That  is,  the  hypersurface 
obtained  by  taking  one  of  the  ideal  observer’s  misclassification  probabilities  as  a  function  of  the  other  TV2  —  TV  —  1 
misclassification  probabilities  is  never  above  the  corresponding  hypersurface  obtained  by  any  other  observer. 
Due  to  the  inherent  complexity  of  evaluating  observer  performance  in  an  TV-class  classification  task  with  TV  >  2, 
some  researchers  have  suggested  a  generally  incomplete  but  more  tractable  evaluation  in  terms  of  a  hypersurface 
plotting  only  the  TV  “sensitivities”  (the  probabilities  of  correctly  classifying  observations  in  the  various  classes). 
An  TV-class  observer  generally  has  up  to  N2  —  TV  —  1  degrees  of  freedom,  so  a  given  sensitivity  will  still  vary  when 
the  other  TV  —  1  are  held  fixed;  a  well-defined  hypersurface  can  be  constructed  by  considering  only  the  maximum 
possible  value  of  one  sensitivity  for  each  achievable  value  of  the  other  TV  —  1.  We  show  that  optimal  performance 
in  terms  of  this  generally  incomplete  performance  descriptor,  in  a  Neyman-Pearson  sense,  is  still  achieved  by 
the  TV-class  ideal  observer.  That  is,  the  hypersurface  obtained  by  taking  the  maximal  value  of  one  of  the  ideal 
observer’s  correct  classification  probabilities  as  a  function  of  the  other  TV  —  1  is  never  below  the  corresponding 
hypersurface  obtained  by  any  other  observer. 

Keywords:  ROC  analysis,  three-class  classification,  ideal  observer  decision  rules 

1.  INTRODUCTION 

We  are  attempting  to  extend  the  well-known  observer  performance  evaluation  methodology  of  receiver  operating 
characteristic  (ROC)  analysis1-  2  to  classification  tasks  with  three  classes.  This  could  conceivably  be  of  benefit, 
for  example,  in  a  medical  decision-making  task  in  which  a  region  of  a  patient  image  must  be  characterized  as 
containing  a  malignant  lesion,  a  benign  lesion,  or  only  normal  tissue. 

Unfortunately,  a  fully  general  but  tractable  extension  of  ROC  analysis  has  yet  to  be  developed.  It  is  known 
that  the  performance  of  an  observer  in  a  classification  task  with  TV  classes  (TV  >  2)  can  be  completely  described 
by  a  set  of  N2  —  TV  conditional  error  probabilities,4, 5  and  that  the  performance  of  the  ideal  observer  (that 
which  minimizes  Bayes  risk4)  is  completely  characterized  by  an  ROC  hypersurface  in  which  these  conditional 
error  probabilities  depend  on  a  set  of  N2  —  TV  —  1  decision  criteria.5  Although  analytic  expressions  for  the  ideal 
observer’s  conditional  error  probabilities  given  reasonable  models  for  the  underlying  observational  date  have 
been  worked  out  in  the  two-class  case,6  this  has  not  yet  been  accomplished  in  a  fully  general  manner  for  tasks 
with  three  or  more  classes.  Furthermore,  we  have  shown  that  an  obvious  generalization  of  the  area  under  the 
ROC  curve  (AUC)  does  not  in  fact  yield  a  useful  performance  metric  in  tasks  with  three  or  more  classes.7  More 
recently,  we  showed  that  complicated  constraining  relationships  exist  among  the  decision  criteria  themselves  for 
the  ideal  observer.8  These  constraining  relationships  appear  to  imply  that  it  is  highly  unlikely  that  analytical 
expressions  for  the  conditional  error  probabilities  in  terms  of  the  decision  criteria  can  be  developed  which  are  as 
simple  to  interpret  as  those  for  the  two-class  task.6 

Despite  the  difficulties  just  described,  the  potential  benefits  to  be  gained  from  a  practical  performance  eval¬ 
uation  methodology  for  classification  tasks  with  three  classes  have  motivated  a  number  of  research  groups  to 
propose  such  methods.  These  practical  methods  reduce  the  number  of  degrees  of  freedom  required  to  describe 
the  observer’s  performance,  either  by  implicitly  leaving  the  remaining  degrees  of  freedom  out  of  the  analysis,  or 
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by  explicitly  imposing  restrictions  on  the  form  of  the  observer’s  decision  rule  or  on  the  set  of  decision  criteria 
used  by  the  observer. 

Scurfield  evaluated  an  observer  which  used  a  specified  decision  rule  with  only  two  degrees  of  freedom  (as 
opposed  to  the  five  decision  criteria  used  by  the  general  three-class  ideal  observer)  by  plotting  a  set  of  six 
(two-dimensional)  surfaces  in  three-dimensional  ROC  spaces.9  Mossman  proposed  plotting  the  surface  formed 
only  from  the  set  of  three  “sensitivities”  (conditional  probabilities  of  correctly  classifying  observations)  for  an 
observer  with  two  degrees  of  freedom,  and  applied  this  method  to  an  observer  with  a  specified  decision  rule.10 
Chan  et  al.  began  with  an  ideal  observer  model,  and  reduced  the  number  of  decision  criteria  from  five  to  two  by 
imposing  explicit  assumptions  on  the  observer’s  decision  utilities;  the  observer’s  performance  was  then  plotted 
as  a  surface  in  a  three-dimensional  ROC  space,  the  axes  of  which  are  the  probabilities  of  deciding  an  observation 
to  be  malignant  conditional  on  each  of  the  three  actual  class  memberships.11  He  etal.  investigated  an  ideal 
observer  model  in  which  the  decision  rule  is  restricted  to  a  form  similar  to  that  proposed  by  Scurfield;  the  nature 
of  the  restrictions  is  such  that  performance  evaluation  in  terms  of  only  the  three  sensitivities  provides  a  complete 
description  of  this  observer’s  performance.12 

A  common  theme  among  these  remarkably  diverse  methods  is  the  idea  of  an  “ROC  surface,”  i.e.,  a  surface 
with  two  degrees  of  freedom  in  a  three-dimensional  ROC  space.  An  appealing  feature  of  such  a  construct  is 
its  visualizability:  it  can  be  plotted  as  readily  as  any  elevation  map,  for  example,  in  stark  contrast  to  the  fully 
general  three-class  classification  task  involving  a  hypersurface  with  five  degrees  of  freedom  in  a  six-dimensional 
ROC  space  as  mentioned  above.  While  it  is  true  that  not  all  of  the  proposed  methods  described  in  the  preceding 
paragraph  involve  a  “sensitivity”  ROC  surface,  the  general  division  of  an  iV-class  observer’s  conditional  decision 
probabilities  into  a  set  of  N  sensitivities  and  a  set  of  N2  —  N  misclassification  rates5  makes  this  particular 
construct  a  natural  candidate  for  further  analysis. 

On  the  other  hand,  it  can  be  argued  that  measurement  of  performance  in  terms  of  only  N  conditional 
classification  rates  must  be  an  incomplete  description  of  observer  performance  in  a  classification  task  with 
more  than  two  classes,  which  requires  N 2  —  N  such  classification  rates  as  stated  above.  Acknowledging  this 
incompleteness,  we  would  like  to  ask  whether  there  is  any  sense  in  which  such  an  incomplete  performance  metric 
is  at  least  well-defined.  In  particular,  is  there  any  observer  decision  rule,  dependent  on  only  N  —  1  (rather 
than  N2  —  N  —  1)  decision  criteria,  for  which  the  observer’s  sensitivity  ROC  hypersurface  is  always  above  the 
corresponding  hypersurface  obtained  for  any  other  observer?  If  so,  what  form  does  this  decision  rule  take? 

In  the  next  section,  we  show  that  the  three-class  observer  which  optimizes  performance  only  in  terms  of  the 
sensitivity  surface  is  in  fact  the  three-class  ideal  observer,  with  its  decision  utilities  constrained  in  a  particular 
way  (reducing  its  degrees  of  freedom  from  five  to  two  as  necessary).  Additionally,  the  form  of  the  constraints 
on  the  ideal  observer’s  behavior  are  identical  to  those  considered  by  He  etal..12  In  Sec.  3,  we  extend  this  result 
to  the  general  case  of  an  N- class  observer,  showing  that  the  observer  which  attains  the  optimal  sensitivity 
hypersurface  is  a  restricted  form  of  the  A-class  ideal  observer,  and  in  particular  a  straightforward  generalization 
of  the  three-class  observer  considered  by  He  etal.12  to  N  classes.  Our  conclusions  are  stated  in  Sec.  4. 

2.  THREE-CLASS  OBSERVERS 

We  have  shown5  that  the  iV-class  ideal  observer  —  that  observer  which  minimizes  Bayes  risk  —  also  achieves 
optimal  performance  in  an  ROC  sense,  by  virtue  of  satisfying  the  Neyman-Pearson  criterion.  This  was  the  same 
argument  used  by  Van  Trees4  to  show  that  the  two-class  ideal  observer  achieves  the  optimal  ROC  curve  for 
a  given  two-class  classification  task.  This  technique  of  satisfying  the  Neyman-Pearson  criterion,  essentially  an 
application  of  an  integral  form  of  the  method  of  Lagrange  multipliers,13  is  straightforward  (conceptually,  if  not 
notationally)  and  flexible,  and  we  apply  it  in  this  section  to  answer  the  question  of  what  observer  optimizes 
performance  in  terms  of  only  the  three  observer  sensitivities. 

We  denote  by  Py  the  conditional  probability  of  a  given  observer  deciding  an  observation  is  drawn  from  the 
ith  class,  conditional  on  it  actually  being  drawn  from  the  jth  class.  Thus,  the  three  sensitivities  are  Pn,  P22, 
and  P33.  Decisions  are  assumed  to  be  made  based  on  statistically  variable  observational  data;  in  particular, 

Pij  =  [  p(x\iTj)  dmx, 

JZi 


(1) 


where  Z,,  is  the  region  for  which  observations  x  (of  dimension  m)  are  decided  to  belong  to  the  class  labeled  7q 
(1  <  i  <  3). 

Without  loss  of  generality,  we  seek  to  maximize  P33  subject  to  the  constraints  Pn  =  an  and  P2 2  =  022 
where  0  <  an  <  1  and  0  <  a22  <  1.  We  define  the  function 

F  =  P33  +  Aii(Pn  —  an),  +A22(P22  —  a22)  (2) 

where  An  and  A22  are  the  so-called  Lagrange  multipliers.  Note  that  if  we  can  find  a  decision  rule  (a  partitioning 
of  the  domain  of  x  into  Z\,  Z-2,  and  Z3)  that  maximizes  F  for  arbitrary  values  of  An  and  A22,  then  this  will 
be  equivalent  to  maximizing  P33  at  the  point  at  which  the  constrain  equations  are  satisfied  (i.  e.,  at  the  point 
P11  =  an,P22  =  a22). 

We  first  rewrite  F  by  applying  rules  for  conditional  probabilities: 

P  =  —Anan  —  A22a22  +  (1  —  P13  —  P23)  +  An(l  —  P2 1  —  P31)  +  A22(l  —  P12  —  P32) 

=  1  +  An(l  —  an)  +  A22(l  —  a22)  —  {A22Pl2  +  P13  +  AnP2i  +  P2  3  +  A11P31  +  A22P32} 

=  1  +  An (1  -  an)  +  A22(l  -  a22)  -  jy  A22p(x|7r2)  +  p(x |7t3)  dmx 

+  Xhp(x\tti)  +p(f|7T3)  dmx+  /  \np(x\Tn)  +  \22p(x\ir2)dmx\  .  (3) 

J  Z2  J  Z3  J 

For  a  given  set  of  values  of  the  parameters  An  and  A22,  F  is  maximized  when  the  quantity  in  braces  is  minimized. 
This  quantity,  in  turn,  can  be  minimized  by  assigning  a  given  x  to  the  region  Zi  such  that  the  ith  integrand 
(from  among  the  integrals  in  braces  in  Eq.  3)  is  minimized.  (Situations  in  which  two  or  more  of  the  integrands 
yield  the  same  minimal  value  for  a  given  x  can  be  decided  in  an  arbitrary  but  consistent  fashion.) 

That  is, 


decide  m  iff  A22p(:r|7r2)  <  Anp(x|7Ti)  and  p(x\n3)  <  Anp(T|7ri)  (4) 

decide  7r2  iff  Anp(x|7Ti)  <  A22p(£|7r2)  and  p(x|7r3)  <  A22p(T|7r2)  (5) 

decide  7r3  iff  Anp(x|7Ti)  <  p(x|7T3)  and  A22p(x|7r2)  <  p(x|7r3).  (6) 

We  can  divide  these  relations  by  p(x|7r3)  to  obtain 

decide  tt\  iff  AuLRi  —  A22LR2  >  0  and  AuLRi  >1  (7) 

decide  7r2  iff  AuLRi  —  A22LR2  <  0  and  A22LR2  >  1  (8) 

decide  713  iff  AuLRi  <  1  and  A22LR2  <  1,  (9) 

where  LRi  =  p(x|7Tj)/p(x|7r3)  are  the  likelihood  ratio  decision  variables  used  by  the  ideal  observer.4,5  The  decision 
boundary  lines  which  partition  the  (LRi,LR2)  decision  plane  into  the  regions  Z\,  Z2,  and  Z3  are  thus 

AuLRi  —  A22LR2  =  0  (10) 

AuLRi  =  1  (11) 

A22LR2  =  1.  (12) 


Note  that  Eq.  12  is  just  the  difference  between  Eqs.  10  and  11.  If  we  require  An  and  A22  to  be  positive,  the 
decision  rule  is  an  ideal  observer  decision  rule.5  Since  neither  the  decision  variables  nor  the  form  of  the  decision 
rule  depend  on  the  particular  choices  of  an  and  a22,  we  can  conclude  that  the  three-class  sensitivity  ROC 
surface,  obtained  by  allowing  An  and  A22  to  take  on  all  possible  positive  values,  is  optimal  for  the  observer 
defined  in  Eqs.  10-12,  in  the  sense  that  no  other  observer  can  achieve  a  higher  sensitivity  surface  (i.e.,  a  surface 
with  a  greater  value  of  P33  at  a  given  value  of  (Pn,P22)).  The  optimal  observer  for  this  performance  metric  is 
seen  to  be  the  three-class  ideal  observer,  with  its  decision  criteria  constrained  so  that  the  line  separating  classes 
7Ti  and  7T3  is  vertical,  the  line  separating  classes  7 r2  and  7r3  is  horizontal,  and  the  line  separating  classes  7Ti  and 


Figure  1.  The  decision  rule  which  is  found  to  be  optimal  in  the  sense  of  maximizing  the  ROC  surface  composed  of  only 
the  observer  sensitivities.  The  decision  variables  are  the  likelihood  ratios  used  by  the  general  three-class  ideal  observer, 
and  the  number  of  decision  criteria  is  reduced  from  five  (for  the  general  three-class  ideal  observer)  to  two. 


7 r2  passes  through  the  origin  with  slope  A11/A22  (and  thus  intersects  the  other  two  lines  as  required).  Note  that 
the  number  of  free  decision  criteria  has  been  reduced  from  five  (for  the  general  three-class  ideal  observer)  to  two 
(as  expected  for  a  surface  in  a  three-dimensional  ROC  space). 

This  decision  rule  is  shown  in  Fig.  1.  It  is  interesting  to  note  that  this  observer  is  identical  to  the  special  case 
of  the  ideal  observer  evaluated  by  He  etal .,12  which  we  have  shown14,15  to  be  a  special  case  of  the  decision  rule 
proposed  by  Scurfield.9 


3.  iV-CLASS  OBSERVERS 

The  results  of  the  preceding  section  can  be  generalized  to  tasks  with  N  classes  for  any  N  >  2.  We  now  have 
a  set  of  N 2  conditional  classification  probabilities  Pij,  with  N  sensitivities  Pa .  Equation  1  remains  unchanged, 
except  that  there  are  of  course  now  N  regions  Z,  into  which  the  domain  of  x  is  partitioned  (i.e.,  classes  into 
which  the  observations  are  classified),  and  the  observations  are  drawn  from  N  distributions  of  the  form  p(x\n:j). 

Without  loss  of  generality,  we  seek  to  maximize  Pnn  subject  to  the  constraints  Pa  =  an  for  1  <  i  <  N  —  1, 
where  0  <  an  <1.  We  define  the  function 

N—l 

F  =  Pnn  +  ^  A u(Pu  —  an),  (13) 

where  the  An  are  the  Lagrange  multipliers.  Note  that  if  we  can  find  a  decision  rule  (a  partitioning  of  the 
domain  of  x  into  Zi  {1  <  *  <  N})  that  maximizes  F  for  arbitrary  values  of  the  An,  then  this  will  be  equivalent 
to  maximizing  Pjvat  at  the  point  at  which  the  constrain  equations  are  satisfied  (i.e.,  at  the  point  Pa  =  an 
{1  <i<N  -  1}). 

As  in  the  preceding  section,  we  rewrite  F  by  applying  rules  for  conditional  probabilities  to  obtain: 

JV-l  f  N-l  \  N-l  I 

F  =  —  A  nan  +  (  1  —  Pin  j  +  ^  A  a  I  1 

i= 1  \  »= 1  /  *=1  \ 


N—  1 

=  1  +  Ajj(l  —  an) 

N 

=  1  +  A,:j(l  —  an) 

i—2 


f 

1 

_ 1 

"AT— 1 

a jjPij  +  Pin 

+ 

1 

h  v»i  J  \ 

.  i= 1 

\jjPix\TTj) 


+  p(f|7TAr)  + 


(14) 


For  a  given  set  of  values  of  the  parameters  Xu  {1  <  i  <  N  —  1},  F  is  maximized  when  the  quantity  in  braces 
is  minimized.  This  quantity,  in  turn,  can  be  minimized  by  assigning  choosing  the  regions  Zj  such  that  a  given 
x  to  the  region  Zi  such  that  the  *th  integrand  (from  among  the  integrals  in  braces  in  Eq.  14)  is  minimized. 
(Situations  in  which  two  or  more  of  the  integrands  yield  the  same  minimal  value  for  a  given  x  can  be  decided  in 
an  arbitrary  but  consistent  fashion.) 

That  is, 


decide  7r,;{i  <  N} 


decide  ttn 


iff  Xjjp(x\nj)  <  X-up(x\TTi) 

and  p(x\ttn)  <  Xup{x\-Ki) 
and  Xjjp(x\iTj)  <  Aiip(f|7ri) 
iff  Xjjp(x\iTj)  <  p(x\ttn) 


{i  <j<N} 

{j  <i  <  N} 
{j  <  N}. 


We  can  divide  these  relations  by  p(x\ttn)  to  obtain 

decide  n.i{i  <  N}  iff  A„LR,  —  A^LRj  >0  {i  <  j  <  N} 

andAriLR..(  >  1 

andAjjLR.j  —  AjjLRj  <0  {j  <  i  <  N} 
decide  ttn  iff  AjyLRj  <1  {j  <  N}, 


(15) 

(16) 


(17) 

(18) 


where  LR(  =  p(x\n.i)/p(x\TTN)  are  the  likelihood  ratio  decision  variables  used  by  the  ideal  observer.4,5  The 
decision  boundary  hyperplanes  which  partition  the  LR  =  (LRi, . . .  ,LRat_i)  decision  space  into  the  regions  Zi 
are  thus 


A„;LR,t  —  AjjLR,  =  0  {i  <  j  <  TV}  (19) 

AjiLRj  =  1  {i<N}.  (20) 

Note  that  any  of  these  equations,  for  example  that  defining  part  of  the  boundary  between  classes  TTj  and  irk  ,  can 
be  expressed  as  the  difference  of  two  other  such  equations  (in  this  example,  those  defining  boundaries  between 
classes  7 q  and  irj,  and  between  classes  pii  and  7 r*,).  If  we  require  the  Xu  to  be  positive,  the  resulting  decision  rule 
is  an  ideal  observer  decision  rule.5  Since  neither  the  decision  variables  nor  the  form  of  the  decision  rule  depend 
on  the  particular  choices  of  an,  we  can  conclude  that  the  iV-class  sensitivity  ROC  hypersurface,  obtained  by 
allowing  the  A  a  to  take  on  all  possible  positive  values,  is  optimal  for  the  observer  defined  in  Eqs.  19  and  20,  in 
the  sense  that  no  other  observer  can  achieve  a  higher  sensitivity  hypersurface  (i.e.,  one  with  a  greater  value  of 
Pnn  at  a  given  value  of  (Pu, . . . ,  -P(jv-i)(jv-i)))-  The  optimal  observer  for  this  performance  metric  is  seen  to 
be  the  IV-class  ideal  observer,  with  its  decision  criteria  constrained  so  that  the  boundary  separating  classes  tt1 
and  7rjv  is  a  hyperplane  defined  by  LR;  =  1  /Xu,  while  the  boundary  separating  classes  7 r,  and  7 Tj  is  a  hyperplane 
defined  by  Ar,LR.(  =  Xj  j LR;/ . 

Although  an  intuitive  geometric  understanding  of  this  decision  rule  is  more  elusive  than  in  the  three-class 
case,  it  is  at  least  evident  that  the  boundaries  intersect  as  expected;  that  is,  the  boundary  separating  classes 
7Ti  and  7 tj  intersects  the  boundary  separating  classes  7T;  and  7Tfc,  and  also  intersects  the  boundary  separating 


classes  ttj  and  7Tfc.  Note  also  that  the  number  of  free  decision  criteria  has  been  reduced  from  N 2  —  N  —  1  (for 
the  general  iV-class  ideal  observer)  to  N  —  1  (as  expected  for  a  hypersurface  in  an  iV-dimensional  ROC  space). 
More  importantly,  comparison  of  Eqs.  19  and  20  with  Eqs.  10-12  reveals  this  TV-class  observer  to  be  an  obvious 
extension  from  three  to  TV  classes  of  the  observer  described  in  the  preceding  section. 

4.  CONCLUSIONS 

A  fully  general  performance  evaluation  methodology  for  the  three-class  classification  task  has  yet  to  be  developed, 
a  frustrating  state  of  affairs  given  the  great  success  and  wide  application  of  ROC  analysis  to  two-class  classification 
tasks.  A  primary  reason  for  the  difficulty  in  developing  a  fully  general  extension  of  ROC  analysis  to  the  three- 
class  classification  task  is  the  rapid  increase  in  the  number  of  performance  measurement  variables  and  decision 
criteria  necessary  to  characterize  observer  (in  particular,  ideal  observer)  performance.  Specifically,  the  number 
of  sensitivities  or  misclassification  rates  needed  increases  from  two  to  six  (and  to  TV 2  —  TV  in  the  general  case), 
while  the  number  of  decision  criteria  increases  from  a  single  decision  variable  threshold  to  a  set  of  five  mutually 
constrained8  criteria  (and  to  TV 2  —  TV  —  1  in  the  general  case).  In  short,  the  complexity  of  the  problem  increases 
not  linearly  with  the  number  of  classes,  but  quadratically. 

The  motivation  for  the  numerous  proposed  methods,  outlined  in  Sec.  1,  for  evaluating  the  performance  of 
a  three-class  classifier  in  terms  of  two-dimensional  surfaces  in  three-dimensional  ROC  spaces  (rather  than  the 
five-dimensional  hypersurfaces  in  six-dimensional  ROC  spaces  required  by  the  theory)  is  thus  quite  clear.  We 
currently  lack  a  theoretical  framework  with  which  to  judge  the  appropriateness  of  any  of  the  proposed  methods 
to  any  particular  classification  task.  However,  even  if  one  chooses  to  adopt  a  performance  evaluation  metric 
known  to  provide  an  incomplete  description  of  observer  performance,  it  is  still  reasonable  to  ask  what  observer, 
if  any,  will  achieve  optimal  performance  with  respect  to  that  metric. 

We  have  addressed  that  question  in  regard  to  measurement  of  an  observer’s  performance  in  terms  of  only 
its  sensitivities  (the  probabilities  of  correctly  classifying  the  three,  or  in  general  TV,  classes  of  observations). 
Theoretically,  this  is  clearly  an  incomplete  measure  of  performance  (another  set  of  three,  or  in  general  TV 2  —  2 TV, 
misclassification  rates  are  necessary).  Conceding  this  point,  we  consider  it  a  nontrivial  observation,  derived  in 
the  preceding  sections,  that  the  observer  which  optimizes  this  limited  performance  metric  is  not  one  unrelated 
to  the  general  ideal  observer,  nor  an  arcane  special  case  of  the  ideal  observer,  but  a  special  case  of  the  ideal 
observer  which  is  in  a  subjective  sense  quite  simple,  and  which  has  been  independently  evaluated  from  very 
different  perspectives  by  other  researchers.9- 12  We  find  these  results  at  once  reassuring  and  encouraging,  and 
hope  that  research  into  this  thorny  problem  will  continue  to  bear  unexpected  fruit. 
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Optimization  of  restricted  ROC  surfaces  in 
three-class  classification  tasks 

Darrin  C.  Edwards*  and  Charles  E.  Metz 


Abstract — We  have  shown  previously  that  an  iV-class  ideal 
observer  achieves  the  optimal  receiver  operating  characteristic 
(ROC)  hypersurface  in  a  Neyman-Pearson  sense.  Due  to  the 
inherent  complexity  of  evaluating  observer  performance  even  in 
a  three-class  classification  task,  some  researchers  have  suggested 
a  generally  incomplete  but  more  tractable  evaluation  in  terms  of 
a  surface  plotting  only  the  three  “sensitivities.”  More  generally, 
one  can  evaluate  observer  performance  with  a  single  sensitivity 
or  misclassification  probability  as  a  function  of  two  linear 
combinations  of  sensitivities  or  misclassification  probabilities. 
We  analyzed  four  such  formulations  including  the  “sensitivity” 
surface.  In  each  case,  we  applied  the  Neyman-Pearson  criterion 
to  find  the  observer  which  achieves  optimal  performance  with 
respect  to  each  given  set  of  “performance  description  variables” 
under  consideration.  In  the  unrestricted  case,  optimization  with 
respect  to  the  Neyman-Pearson  criterion  yields  the  ideal  observer, 
as  does  maximization  of  the  observer's  expected  utility.  Moreover, 
during  our  consideration  of  the  restricted  cases,  we  found  that 
the  two  optimization  methods  do  not  merely  yield  the  same 
observer,  but  are  in  fact  completely  equivalent  in  a  mathematical 
sense.  Thus,  for  a  wide  variety  of  observers  which  maximize 
performance  with  respect  to  a  restricted  ROC  surface  in  the 
Neyman-Pearson  sense,  that  ROC  surface  can  also  be  shown  to 
provide  a  complete  description  of  the  observer’s  performance  in 
an  expected-utility  sense. 

Index  Terms — ROC  analysis,  three-class  classification,  ideal 
observer  decision  rules,  Neyman-Pearson  criterion,  expected 
utility  maximization 

I.  Introduction 

WE  are  attempting  to  extend  the  well-known  observer 
performance  evaluation  methodology  of  receiver  op¬ 
erating  characteristic  (ROC)  analysis  [1],  [2]  to  classification 
tasks  with  three  classes.  This  could  conceivably  be  of  benefit, 
for  example,  in  a  medical  decision-making  task  in  which  a 
region  of  a  patient  image  must  be  characterized  as  containing 
a  malignant  lesion,  a  benign  lesion,  or  only  normal  tissue  [3], 
Unfortunately,  a  fully  general  extension  of  ROC  analysis 
to  classification  tasks  with  more  than  two  classes  has  yet 
to  be  developed.  It  is  known  that  the  performance  of  an 
observer  in  a  classification  task  with  N  classes  (N  >  2) 
can  be  completely  described  by  a  set  of  N2  —  N  condi¬ 
tional  error  probabilities  [4],  [5],  and  that  the  performance 
of  the  ideal  observer  (that  which  minimizes  Bayes  risk  [4]) 
is  completely  characterized  by  an  ROC  hypersurface  in  which 
these  conditional  error  probabilities  depend  on  a  set  of  N2  — 
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N  —  1  decision  criteria  [5],  Although  analytic  expressions 
for  the  ideal  observer’s  conditional  error  probabilities  given 
reasonable  models  for  the  underlying  observational  data  have 
been  worked  out  in  the  two-class  case  [6],  this  has  not  yet 
been  accomplished  in  a  fully  general  manner  for  tasks  with 
three  or  more  classes.  Furthermore,  we  have  shown  that  an 
obvious  generalization  of  the  area  under  the  ROC  curve  (AUC) 
does  not  in  fact  yield  a  useful  performance  metric  in  tasks 
with  three  or  more  classes  [7],  More  recently,  we  showed 
that  complicated  constraining  relationships  exist  among  the 
decision  criteria  themselves  for  the  ideal  observer  [8],  These 
constraining  relationships  appear  to  imply  that  it  is  highly 
unlikely  that  analytical  expressions  for  the  conditional  error 
probabilities  in  terms  of  the  decision  criteria  can  be  developed 
which  are  as  simple  to  interpret  as  those  for  the  two-class 
task  [6], 

Despite  the  difficulties  just  described,  the  potential  ben¬ 
efits  to  be  gained  from  a  practical  performance  evaluation 
methodology  for  classification  tasks  with  three  classes  have 
motivated  a  number  of  research  groups  to  propose  such  meth¬ 
ods.  These  practical  methods  reduce  the  number  of  degrees  of 
freedom  used  to  describe  the  observer’s  performance,  either 
by  implicitly  leaving  the  remaining  degrees  of  freedom  out 
of  the  analysis,  or  by  explicitly  imposing  restrictions  on  the 
form  of  the  observer’s  decision  rule  or  on  the  set  of  decision 
criteria  used  by  the  observer.  In  this  work,  we  are  concerned 
specifically  with  the  latter  case,  and  we  will  refer  to  such  a 
model  as  a  “restricted”  performance  evaluation  methodology. 

Scurfield  evaluated  an  observer  which  used  a  specified 
decision  rule  with  only  two  degrees  of  freedom  (in  general 
a  three-class  observer  can  have  up  to  five  degrees  of  freedom) 
by  plotting  a  set  of  six  (two-dimensional)  surfaces  in  three- 
dimensional  ROC  spaces  [9],  Mossman  proposed  plotting 
the  surface  formed  only  from  the  set  of  three  “sensitivities” 
(conditional  probabilities  of  correctly  classifying  observations) 
for  an  observer  with  two  degrees  of  freedom,  and  applied  this 
method  to  an  observer  with  a  specified  decision  rule  [10], 
Chan  et  al.  began  with  an  ideal  observer  model,  and  reduced 
the  number  of  decision  criteria  from  five  to  two  by  imposing 
explicit  assumptions  on  the  observer’s  decision  utilities.  The 
observer’s  performance  was  then  plotted  as  a  surface  in  a 
three-dimensional  ROC  space,  the  axes  of  which  are  the  three 
conditional  probabilities  of  deciding  an  observation  to  be 
malignant  (this  description  of  performance  was  also  shown 
to  be  complete)  [11],  He  etal.  investigated  a  special  case 
of  the  ideal  observer  model  which  is  also  a  special  case  of 
the  decision  rule  proposed  by  Scurfield;  they  showed  that  due 
to  the  assumptions  of  their  model,  performance  evaluation 
in  terms  of  only  the  three  sensitivities  provides  a  complete 
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description  of  this  observer’s  performance  [12],  Recently  we 
investigated  the  relationships  between  each  of  these  proposed 
decision  rules  and  the  decision  rule  used  by  the  three-class 
ideal  observer  [13];  that  work,  however,  was  limited  to  theoret¬ 
ical  aspects  of  the  decision  rules  themselves,  and  did  not  take 
into  account  the  important  issue  of  performance  measurement. 
The  present  work  attempts  to  address  this  issue;  it  continues 
our  analysis  of  the  proposed  decision  strategies  described 
above,  specifically  from  the  point  of  view  of  ROC  analysis. 

A  common  theme  among  these  remarkably  diverse  proposed 
decision  strategies  is  the  idea  of  an  “ROC  surface,”  i.  e.,  a 
surface  with  two  degrees  of  freedom  in  a  three-dimensional 
ROC  space.  An  appealing  feature  of  such  a  construct  is  its 
visualizability;  it  can  be  plotted  as  readily  as  any  elevation 
map,  for  example,  in  stark  contrast  to  the  fully  general 
three-class  classification  task  involving  a  hypersurface  with 
five  degrees  of  freedom  in  a  six-dimensional  ROC  space  as 
mentioned  above. 

On  the  other  hand,  it  can  be  argued  that  measurement 
of  three-class  classification  performance  in  terms  of  only 
three  conditional  classification  rates  may  yield  an  incomplete 
description  of  observer  performance;  for  example,  a  complete 
description  of  the  unrestricted  three-class  ideal  observer’s 
performance  requires  six  such  conditional  classification  rates, 
as  stated  above.  Acknowledging  this  possible  incompleteness, 
we  would  like  to  ask  whether  there  is  any  sense  in  which 
such  a  restricted  performance  evaluation  method  is  at  least 
well-defined.  In  particular,  suppose  we  elect  to  measure  per¬ 
formance  in  terms  of  an  ROC  surface  given  by  a  single  linear 
combination  of  either  sensitivities  or  of  conditional  error  rates 
as  a  function  of  two  different  linear  combinations  of  other 
conditional  classification  rates.  We  then  ask,  is  there  any 
observer  decision  rule,  dependent  on  only  two  (rather  than  five) 
decision  criteria,  for  which  the  specified  ROC  surface  is  never 
below  (when  the  surface’s  dependent  variable  is  a  sensitivity) 
or  never  above  (when  the  surface’s  dependent  variable  is  a 
conditional  error  rate)  the  corresponding  surface  obtained  for 
any  other  observer?  If  so,  what  form  does  this  decision  rule 
take? 

In  attempting  to  answer  this  question  for  the  special  cases 
listed  above,  as  well  as  for  closely  related  models,  we  applied 
the  Neyman-Pearson  criterion  to  find  the  observer  which 
achieves  optimal  performance  with  respect  to  each  given  set 
of  “performance  description  variables”  (the  particular  set  of 
three  linear  combinations  of  sensitivities  or  conditional  error 
rates  under  consideration).  In  the  unrestricted  case,  it  is  well 
known  for  N  =  2  [4],  and  we  showed  recently  for  N  > 
2  [5],  that  optimization  with  respect  to  the  Neyman-Pearson 
criterion  yields  the  same  observer  as  does  maximization  of 
the  observer’s  expected  utility  (or,  equivalently,  minimization 
of  Bayes’s  risk):  namely,  the  ideal  observer.  During  our 
consideration  of  the  restricted  cases,  we  found  that  the  two 
optimization  methods  do  not  merely  yield  the  same  observer, 
but  are  in  fact  completely  equivalent  in  a  mathematical  sense. 

The  proof  of  this  equivalence,  in  the  unrestricted  case,  is 
given  in  Sec.  II.  In  Sec.  Ill,  we  show  that  the  equivalence  holds 
true  even  in  a  “restricted  case”  such  as  those  just  mentioned  — 
specifically,  when  a  linear  constraint  is  applied  to  the  utilities 


used  by  the  ideal  observer  to  make  decisions,  thereby  reducing 
the  number  of  performance  description  variables  required  to 
describe  the  performance  of  the  resulting  observer.  We  then 
analyze  four  different  observer  decision  strategies  proposed 
recently  in  the  literature  (and  known  to  be  special  cases  of  the 
three-class  ideal  observer  [13])  in  light  of  this  result.  (It  should 
be  noted  that,  although  the  restricted  cases  we  consider  are 
all,  in  fact,  special  cases  which  are  in  some  sense  “derivable” 
from  the  unrestricted  model  we  considered  previously  [5], 
the  derivations  in  that  previous  work  did  not  consider  the 
possibility  of  introducing  any  such  constraints  on  the  decision 
process.)  For  the  reader’s  convenience,  much  of  the  mathe¬ 
matical  detail  of  this  analysis  is  relegated  to  corresponding 
appendices.  Finally,  these  results  are  summarized,  and  our 
conclusions  presented,  in  Secs.  IV  and  V. 

II.  The  Equivalence  of  the  Neyman-Pearson  and 
Expected  Utility  Optimizations 

The  expected  utility  of  the  decisions  made  by  an  observer 
in  an  TV-class  classification  task  can  be  expressed  as  [5] 

N  N 

£{U}  =  ^2^2ui\jP{d  =  Tri,t  =  iTj) 

*= 1  .7=1 

N  N 

=  5m  ui\jp(d  =  ^1*  =  =  T?)>  (!) 

i=i  l 

where  the  labels  7Ti  through  ttn  identify  the  classes  to  which 
observations  belong;  the  number  U^j  is  defined  as  the  utility 
of  deciding  an  observation  belongs  to  class  7 u  given  that  it 
is  actually  drawn  from  class  tt;;  and  the  random  variables 
t  and  d  indicate  the  true  class  to  which  a  randomly  drawn 
observation  belongs  and  the  observer’s  decision  for  classifying 
that  observation,  respectively.  (We  use  boldface  type  to  denote 
statistically  variable  quantities.)  For  notational  simplicity,  we 
will  write  the  conditional  classification  rate  P(d  =  7r,|t  =  ttj) 
as  P,-j ,  and  the  a  priori  class  membership  probability  P(t  = 
7 Ti)  as  P(7Tj). 

For  a  three-class  classification  task,  the  expected  utility  can 
be  written  explicitly  as 

P{U}  =  [Pi|iPll  +  P2II-P2I  +  t4|l-f3l]-P(7rl) 

+  [^I|2-Pl2  +  E^2|2-f22  +  £^3|2-f32]-P(7T2) 

+  [^I|3-Fl3  +  U 2I3P23  +  ^3|3-f>33]-P(7r3)-  (2) 

Note  that  the  nine  conditional  classification  rates  P7/  appearing 
in  this  expression  are  not  independent;  for  example,  given  the 
definition  of  conditional  probability,  it  must  be  the  case  that 
Pn  +  P21  +  P31  =  1.  Thus  within  any  pair  of  square  brackets 
in  (2),  one  of  the  three  conditional  classification  rates  can  be 
eliminated,  leaving  an  expression  which  depends  in  general  on 
six  conditional  classification  rates. 

It  can  readily  be  shown  that  the  observer  which  maximizes 
this  expected  utility  is  in  fact  the  ideal  observer  [4],  [5]. 
(Note  that  in  our  previous  work,  we  demonstrated  that  the 
observer  which  maximizes  Et{U(x,t)\x}  is  the  ideal  ob¬ 
server  [5];  this  is  consistent  with  the  present  statement  because 
P{U}  =  [ZP (5c,  t)  |5c] }.  and  therefore  maximizing  the 
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inner  expectation  value  at  each  given  value  of  x  will  maximize 
P{U}.)  The  three-class  ideal  observer  makes  decisions  by 
partitioning  a  likelihood  ratio  decision  variable  plane  into  three 
regions  with  three  intersecting  lines  [4],  [5].  The  likelihood 
ratios  can  be  taken  to  be  LRi  =  p(x|7Ti) /p(x (713)  and  LR2  = 
p(x|  ^2)/ p(x|  ^3),  ratios  of  the  conditional  probability  density 
functions  of  the  observational  data  x  taken  as  functions  of  that 
random  observational  data.  In  the  notation  we  advocate  [8],  the 
equations  for  the  three  decision  boundary  lines  are 

7l2lLRi  —  7212LR2  =  7313  —  7323  (3) 

7l3lLRi  +  (7232  ~  7212)LR.2  =  7313  (4) 

(7131  —  7121  )LRi  +  7232LR2  =  7323 1  (5) 

which  we  call,  respectively,  the  “1  -v.v.-2”  line,  the  “l-vs.-3” 
line,  and  the  “2-vs.-3”  line.  Here  7^  =  ([/q,;  — 
since  the  utility  of  a  correct  decision  can  be  assumed  to  be 
greater  than  that  of  an  incorrect  decision,  the  7^  can  be 
understood  to  be  positive.  (Note  also  that  because  (3)-(5)  can 
be  multiplied  by  any  positive  constant  without  changing  the 
resulting  decision  boundary  lines,  those  lines  are  determined 
by  five  rather  than  six  parameters,  or  N2  —  N  —  1  rather  than 
N 2  —  N  in  general  [5].)  In  this  notation,  the  expression  in  (2) 
can  be  simplified  to  obtain 

f?{U}  =  f71|1P(7Ti)  +  U2\2P{t^2)  +  U3\3P(lT3) 

—  7121-P21  —  7131-P31 

—  7212-Pl2  —  7232-P32 

—  7313-PL3  —  7323^23-  (6) 

An  alternative  method  for  defining  “optimal  performance” 
is  in  terms  of  the  Neyman-Pearson  criterion  [4],  [5];  the  tech¬ 
nique  of  satisfying  the  Neyman-Pearson  criterion  is  essentially 
an  application  of  an  integral  form  of  the  method  of  Lagrange 
multipliers  [14].  As  just  stated,  the  behavior  of  the  ideal 
observer  is  governed  by  N2  —  N  —  1  parameters,  and  for  the 
present  discussion  we  restrict  our  consideration  of  non-ideal 
observers  to  those  with  N2  —  N—l  degrees  of  freedom  as  well. 
Without  loss  of  generality,  an  observer’s  ROC  hypersurface 
can  be  defined  as  P/vnv-i)  taken  as  a  function  of  the  other 
N2  —  N  —  1  conditional  error  probabilities.  (The  restriction 
just  made  is  then  seen  to  be  of  little  practical  consequence:  for 
an  observer  with  more  than  N2  —  N  —  1  degrees  of  freedom, 
one  is  free  to  consider  only  combinations  of  parameters  such 
that  Pjv(jv-i)  R  minimized  for  a  given  set  of  the  independent 
variables,  reducing  the  number  of  parameters  to  N 2  N  -  1; 
while  for  an  observer  with  fewer  than  N2  —  N  —  1  degrees  of 
freedom,  it  is  simply  the  case  that  Pn(n-i)  is  undefined  for 
particular  combinations  of  the  independent  variables.) 

For  the  three-class  classification  task  under  considera¬ 
tion,  the  ROC  hypersurface  is  thus  given  by  P32  = 

R(Pi2,  P13,  P21,  P23,  P31).  It  is  reasonable  to  define  an  “op¬ 
timal  observer”  as  one  that  achieves  the  lowest  possible  value 
of  P32  for  a  given  set  of  values  of  P12,  P13,  -P21,  P23,  -P31; 
this  condition  is  known  as  the  Neyman-Pearson  criterion, 
and  it  can  be  shown  that  the  observer  which  satisfies  this 
criterion  is  in  fact  the  ideal  observer  [4],  [5]  —  i.  e.,  the  same 
observer  obtained  by  maximizing  the  expected  utility  in  (2)  or. 


equivalently,  (6).  We  will  not  reproduce  the  entire  derivation 
of  that  result  here;  it  will  be  sufficient  to  outline  the  motivation 
for  the  Neyman-Pearson  criterion. 

As  stated,  we  seek  to  minimize  P32,  or,  equiva¬ 
lently,  maximize  — P32,  at  a  particular  set  of  values 
of  P12,  P13,  P21,  P23,  P31.  Following  the  notation  of  Van 
Trees  [4],  we  denote  those  particular  values  by  al3  (e.g.,  ai2 
is  the  particular  value  of  Pi 2  under  consideration).  We  then 
construct  the  function 

F  =  Pvi  ~  y*,  A ij{Pij  —  Otij).  (7) 

(The  term  for  i  =  3  and  j  =  2  is  to  be  understood  as  excluded 
from  the  sum,  here  and  throughout  this  section.)  If  F  can 
be  maximized  over  all  values  of  the  P,;7 ,  and  if  the  maximal 
value  does  not  depend  on  the  q,,  ,  then  at  the  particular  set 
of  independent  variables  such  that  P,;;/  =  ai3,  the  terms  in 
the  sum  (the  “constraints”)  will  vanish;  the  maximum  in  F  at 
that  point  will  correspond  simply  to  a  minimum  of  P32  at  the 
particular  set  of  independent  variables  in  question. 

Since  the  factors  Xl3  appearing  in  front  of  the  constraints 
(the  so-called  Lagrange  multipliers)  are  in  any  practical  sense 
arbitrary,  we  are  free  to  make  whatever  choice  is  convenient 
(effectively,  this  is  equivalent  to  choosing  a  “scale”  for  F  rel¬ 
ative  to  P32  and  the  other  conditional  probabilities).  Consider 
the  change  of  variables 

7 jij  =  7232  Xjj ,  (8) 

where  7232  is  in  turn  defined  as  some  arbitrary  positive 
constant  ( cf  the  statement  after  (5)  that  the  set  of  six  values  of 
7 jij  could  be  reduced  to  five  by  multiplying  by  any  convenient 
positive  constant).  The  values  of  7232  and  the  other  7 jij  are 
here  assumed  to  be  positive;  although  the  A?J-  are,  as  just 
stated,  effectively  arbitrary,  we  will  be  able  to  show  shortly 
that  this  restriction  to  positive  values  does  not  result  in  a  loss 
of  generality. 

With  this  substitution,  the  Neyman-Pearson  function  can  be 
rewritten  as 

7232P  =  —  7232P32  —  y  7 jij(Pjj  ~  aij ) 

=  £(u}  -  y  Ui\jP{^i) + y 7ji3ani  (9) 

i.  e. ,  the  expression  for  expected  utility  plus  constant  terms 
independent  of  the  observer’s  decision  rule  (which  determines 
the  Py).  In  this  form,  the  fact  that  maximization  of  expected 
utility  and  satisfaction  of  the  Neyman-Pearson  criterion  both 
yield  the  ideal  observer  is  seen  to  be  not  merely  an  elegant 
convenience,  but  a  necessary  consequence  of  the  mathematical 
equivalence  of  the  two  methods.  It  is  also  worth  noting  that, 
by  replacing  7232  with  the  more  general  7(jv-i)jv(jv-i)  and 
removing  the  implicit  restrictions  on  i  and  j  to  (1,2,3),  the 
equivalence  in  (9)  is  seen  to  hold  for  classification  tasks  with 
an  arbitrary  number  of  classes,  not  just  three. 

We  can  now  also  justify  the  claim  just  made  that  the 
Lagrange  multipliers  can  be  restricted  to  positive  values 
without  loss  of  generality  (assuming  7232  to  be  an  arbitrary 
positive  constant).  In  the  context  of  expected  utility,  a  negative 


4 


value  of  X-ij  or,  equivalently,  of  7 jij  would  correspond  to  an 
incorrect  decision  having  a  utility  greater  than  that  of  the 
corresponding  correct  decision.  This  possibility  (equivalent 
in  a  two-class  classification  task  to  an  ROC  operating  point 
“below  the  guessing  line”)  can,  at  least  in  the  context  of  the 
ideal  observer,  be  ignored  as  being  “perverse.”  (A  zero  value 
of  7 corresponds  to  an  incorrect  decision  having  a  utility 
exactly  equal  to  that  of  the  corresponding  correct  decision. 
Although  we  choose  to  ignore  this  situation  in  the  general 
case,  a  model  in  which  some  of  the  7 jij  are  set  to  zero  without 
contradiction  is  considered  in  Sec.  III-B.) 

III.  Restricted  ROC  Surfaces 
A.  Theoretical  Considerations 

In  the  Introduction,  it  was  pointed  out  that  the  complexity 
of  the  three-class  classification  task  has  so  far  hindered  the 
development  of  a  fully  general  extension  of  ROC  analysis 
to  this  task.  As  a  result,  many  researchers  have  proposed 
simplified  or  restricted  performance  evaluation  strategies;  a 
number  of  these,  also  mentioned  in  the  Introduction,  involve 
the  imposition  of  linear  “constraints”  on  the  utilities  used  by 
the  ideal  observer  to  make  decisions.  (In  previous  work,  we 
examined  the  relationship  between  those  proposed  decision 
rules  and  the  decision  rule  used  by  the  ideal  observer,  without 
explicit  regard  to  performance  evaluation  issues  [13].)  The 
practical  effect  of  these  constraints,  as  will  be  shown  in 
more  detail  in  the  remainder  of  this  section,  is  to  reduce  the 
number  of  performance  description  variables  (the  sensitivities 
or  conditional  error  rates)  needed  to  describe  the  observer’s 
performance.  In  this  section,  we  will  first  demonstrate  that 
the  equivalence  between  expected  utility  maximization  and 
optimization  through  the  Neyman-Pearson  criterion  also  holds 
when  arbitrary  linear  constraints  are  placed  on  the  decision 
utilities;  this  result  is  shown  to  hold  true  for  classification  tasks 
with  an  arbitrary  number  of  classes.  We  will  then  illustrate  this 
equivalence  explicitly  by  considering  four  proposed  restricted 
three-class  models. 

Consider  a  simple  linear  constraint  on  the  decision  utilities 
of  the  form  JJt\j  =  Uk\i-  In  the  special  case  i  =  j  =  l,  we 
clearly  have  7 iki  =  (£%  -  £4|i)-P(7Tj)  =  0.  Similarly,  any 
linear  constraint  on  the  utilities  U.^j  can  be  reexpressed  as  a 
linear  constraint  on  the  7.^  parameters,  which  we  can  write 
as 

7 iji  =  ylkTklk,  (10) 

k^l 

where  the  vik  are  a  set  of  constants  determining  the  constraint. 

Substituting  (10)  into  (6)  allows  us  to  write 

Ea{  U}  =  Y.Ui\iP^i) 

—  {...+  7  klk{Plk  +  VlkPji )  +  ...}.  (11) 

(Here  the  subscript  v  on  the  expectation  operator  denotes  the 
restriction  imposed  on  the  utilities  via  the  vik ,  and  not  a 
random  variable  over  which  the  expectation  is  taken.)  Note  that 
7 no  longer  appears  in  the  expression  for  expected  utility, 
which  now  depends  on  a  set  of  N 2  —  N  —  1  generalized  perfor¬ 
mance  description  variables  (GPDVs)  —  i.  e.,  the  expressions 


Pik  +  vikPji ■  hi  general,  of  course,  these  may  not  have  any 
obvious  practical  interpretation  in  terms  of  the  performance  of 
the  observer  (hence  the  use  of  the  word  “generalized”).  For 
non-negative  values  of  vik,  however,  it  is  at  least  the  case  that 
a  weighted  sum  of  sensitivities  will  still  behave  in  some  sense 
like  a  sensitivity  (higher  values  for  a  given  observer  are  better 
than  lower  ones),  and  a  weighted  sum  of  conditional  error 
rates  still  behaves  like  a  conditional  error  rate  (lower  values 
are  better  than  higher  ones).  This  should  be  regarded  as  a 
practical  rather  than  theoretical  consideration,  and  it  is  in  some 
sense  an  obligation  of  a  proposed  restricted  method  that  the 
actual  GPDVs  involved  be  justifiable  (or  at  least  interpre table). 
This  will  be  attempted  in  the  remainder  of  this  section  during 
consideration  of  the  four  special  cases  referred  to  above. 

For  the  moment,  note  that  if  we  construct  a  Neyman- 
Pearson  function  Fa  from  the  remaining  N2  —  N  —  1  GPDVs, 
analogous  to  (7),  the  result,  after  a  suitable  selection  of  the 
A ik  parameters,  will  again  be  a  complete  equivalence  between 
Fa  and  Fa  {  U  } .  That  is,  the  expressions  will  be  equal  to 
within  a  positive  scale  factor  and  an  additive  term  independent 
of  the  observer’s  decision  rule.  It  remains  only  to  note  that 
an  arbitrary  number  of  such  linear  constraints  can  be  further 
imposed  (up  to  a  total  of  N2  —  TV  —  1,  in  order  to  be  left  with 
at  least  one  GPDV)  with  equivalence  continuing  to  hold.  In 
the  next  four  subsections,  we  consider  three-class  classification 
tasks  in  which  three  constraints  are  imposed  on  the  utilities, 
leaving  a  set  of  three  GPDVs  (i.  e.,  an  ROC  surface  with  two 
degrees  of  freedom  in  a  three-dimensional  ROC  space). 

Before  turning  to  those  special  cases,  however,  it  is  perhaps 
worth  summarizing  the  results  of  the  preceding  paragraphs. 
Briefly,  if  one  imposes  particular  constraints  on  the  behavior 
of  an  V -cl  ass  ideal  observer,  the  resulting  expected  utility 
for  that  constrained  observer  will  depend  on  fewer  than 
N2  —  N  GPDVs.  Description  of  the  constrained  observer’s 
performance  in  terms  of  this  reduced  number  of  GPDVs  is 
therefore  complete  from  the  point  of  view  of  expected  utility. 
Furthermore,  given  the  mathematical  equivalence  of  Fa  and 
Fa  { U }  just  demonstrated,  the  performance  of  the  observer 
which  maximizes  Fa  is  also  completely  described  by  the  same 
reduced  set  of  GPDVs. 

B.  The  Chan  et  al.  Observer 

Chan  et  al.  consider  a  three-class  classification  task  in  which 
class  7Ti  represents  “benign,”  class  712  “normal,”  and  class 
7T3  “malignant”  observations  (e.g.,  for  structures  evident  in 
a  medical  image)  [11].  They  simplify  the  expression  in  (2)  by 
restricting  all  values  of  utility  to  lie  between  Umin  and  t/max; 
by  setting  the  “correct  decision”  utilities  U- 1|1,  U2 \2,  and  £^313 
to  be  f7max;  the  “missed  malignancy”  utilities  U^3  and  C/2|3  to 
be  Umin',  and  the  utilities  for  incorrect  decisions  not  involving 
malignancies  U\\2  and  U2\\  to  be  £/max.  The  remaining  “false¬ 
positive”  utilities  U3\\  and  U3\2  are  free  to  vary  in  the  range 
[LJ mi,, .  L'jnax]-  In  our  notation,  this  corresponds  to  imposing  the 
three  constraints  7121  =  0,  7212  =  0,  and  7313  =  7323-  (The 
remaining  condition  7313  =  (C/max  -  C/min)P(7r3)  is  not  an 
additional  constraint  —  in  the  sense  of  restricting  the  form  of 
the  observer’s  decision  rule  —  but  merely  determines  the  scale 
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Fig.  1.  The  decision  strategy  investigated  by  Chan  etal.,  which  is  a  special 
case  of  the  ideal  observer  decision  strategy.  Observations  in  the  unlabeled 
region  are  decided  “not  7T3,”  i.  e.,  either  “7ri”  or  “ 7T2 


of  the  remaining  parameters  as  explained  in  the  text  following 

(5).) 

With  these  assumptions,  the  expression  for  expected  utility 
is  reduced  to 

-^{Uchan}  =  ^mjx 

—  7131-P31  ~  7232^32 

—  7313  (-Pl3  +  -P23) 

—  f^max  7313 

^  7131-P31  —  7232^32  +  7313-F33,  (12) 

since  Pi3  +  P23  =  1  -  P33  (again,  note  that  7i3i,  7232,  and 
73i3  are  dependent  on  only  two  free  parameters  C/3 |j  and 
P3|2).  As  Chan  etal.  point  out  [11],  this  expression  depends 
on  three  rather  than  six  GPDVs,  namely  p>i ,  P32,  and  P33. 
These  three  rates  are  used  to  construct  the  ROC  space  in  which 
they  analyze  the  performance  of  their  observer.  That  observer 
in  turn  is  the  special  case  of  the  ideal  observer  obtained  by 
imposing  the  above  constraints  on  the  decision  utilities  Ui\j 
or,  equivalently,  on  the  parameters  7 jij. 

Although  we  have  found  it  useful  to  assume  the  quantities 
7 jij  to  be  strictly  positive,  this  is  not  a  fundamental  require¬ 
ment,  and  Chan  etal.  indeed  allow  some  of  them  (e.  g.,  7m)  to 
be  zero  (consistent  with  the  constraints  they  place  on  the  U.^j 
as  described  above).  They  obtain  the  resulting  ideal  observer 
decision  lines 

OLRi-OLR.2  =  0  {“l-vs.-2”}  (13) 

7131LR1  +  7232LR2  =  73i3  {“1-VS.-3”}  (14) 

7131LR1  +  7232LR2  =  73i3  {“2-VS.-3”},  (15) 

which  actually  correspond  to  a  single  line  (as  the  first  is  un¬ 
defined  and  the  remaining  two  are  degenerate).  This  decision 
strategy  is  illustrated  in  Fig.  1. 

In  summary,  Chan  etal.  begin  with  a  three-class  ideal 
observer  model,  impose  particular  constraints  on  the  decision 


utilities  in  that  model,  and  then  determine,  based  on  those  con¬ 
straints,  both  the  resulting  form  of  the  special  case  of  the  ideal 
observer  and  the  conditional  classification  rates  appropriate  to 
measuring  its  performance.  We  now  wish  to  pose  a  question 
from  a  different  point  of  view:  suppose  one  chooses  to  measure 
arbitrary  ( i.  e. ,  not  necessarily  ideal)  observer  performance 
only  in  terms  of  the  conditional  classification  rates  P33,  P3i, 
and  P32,  ignoring  the  other  rates.  For  any  observer,  we  can 
construct  an  ROC  surface  with  P33  as  a  function  of  P3i  and 
P32-  (For  an  observer  with  more  than  two  degrees  of  freedom 
in  its  decision  strategy,  one  can  simply  define  the  surface  to  be 
the  maximum  value  of  P33  achievable  at  any  given  (P3i,  P32) 
pair.)  What  observer,  if  any,  will  achieve  optimal  performance 
with  respect  to  this  surface? 

We  seek  to  maximize  P33  at  a  particular  point  ( /7,  i  = 
O31J-P32  =  «■  3 2 )  in  the  domain  of  the  given  ROC  space. 
Another  way  of  stating  this  is  to  consider  P33,  P>i ,  and 
P32  as  functionals  of  the  observer’s  decision  rule;  we  seek 
to  maximize  P33  subject  to  the  constraints  P3i  =  a3i  and 
P3 2  =  a32.  To  find  this  maximum,  we  define  a  function 

^Chan  =  -^33  —  A3i(P3i  —  a3i)  —  A  32(P32  —  a32),  (16) 

where  A3i  and  A32  are  free  parameters  (the  so-called  Lagrange 
multipliers).  Note  that  maximizing  Pchan  at  the  particular 
point  (P3i  =  a3i,P32  =  a32)  is  equivalent  to  maximizing 
P33  at  that  point;  if  the  maxima  for  arbitrary  points  (P3i,  P32) 
are  achieved  by  a  single  decision  rule  independent  of  a3i  and 
a32,  the  resulting  surface  will  be  the  desired  optimal  surface. 

The  functional  in  (16)  is  maximized  in  App.  A.  The  bound¬ 
ary  lines  which  partition  the  (LRi,LR2)  decision  variable 
plane  into  the  regions  Z 1,  Z>,  and  Z3  are  found  to  be 


OLRi  —  OLR.2 

=  0 

{“1-VS.-2”} 

(17) 

A3iLRi  +  A32LR.2 

=  1 

l 

< 

pa 

1 

(18) 

A3iLRi  +  A32LR2 

=  1 

{“2-VS.-3”}. 

(19) 

If  we  require  A3i  and  A32  to  be  positive,  and  then  define 
the  quantities  7i3i  =  73i3A3i  and  7232  =  73i3A32  for 
a  positive  constant  73i3,  the  resulting  decision  strategy  is 
found  to  be  identical  to  that  stated  in  (13)— (15).  The  special 
case  of  the  ideal  observer  proposed  by  Chan  etal .,  whose 
performance  depends  only  on  the  conditional  classification 
rates  P33,  P3 1,  and  P3 2  by  (12),  is  indeed  the  observer 
which  obtains  optimal  performance  with  respect  to  this  set 
of  conditional  classification  rates.  By  the  argument  at  the  end 
of  Sec.  III-A,  this  description  of  the  constrained  observer’s 
performance  is  complete. 

C.  The  He  et  al.  Obsen’er 

He  etal.  also  begin  with  a  three-class  ideal  observer  model 
and  thus  with  the  expression  for  expected  utility  given  in  (2); 
the  classification  task  of  interest  to  them  is  to  distinguish 
normal,  infarcted,  and  ischemic  tissue  based  on  myocardial 
perfusion  SPECT  [12],  They  simplify  this  expression  by 
requiring  that  the  two  possible  incorrect  classifications  of 
observations  actually  from  a  given  class  be  equal.  That  is, 
U2\1  =  (A3|i,  U\\2  =  C/312,  and  =  U2 13.  These  can 
immediately  be  expressed  as  the  (linear)  constraints  7121  = 
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Fig.  2.  The  decision  strategy  investigated  by  He  etal.,  which  is  a  special  case 
of  the  ideal  observer  decision  strategy,  and  which  can  also  be  shown  to  be 
a  special  case  of  the  Scurfield  observer  in  which  the  decision  variables  used 
are  the  logarithms  of  the  likelihood  ratios  (LRi,  LR2)  of  the  observational 
data. 

7131,  7212  =  7232,  and  7313  =  7323.  The  expression  for 
expected  utility  is  thereby  reduced,  in  our  notation,  to 

^{UHe}  =  t^lll-P^l)  +  ^2|2-P02)  +  U3\3P(n3) 

—  7121  (-P21  +  P31) 

—  7212  (-Pl2  +  -P32) 

—  7313  (-P12  +  -P32) 

=  ^iiP(Tri)  +  U2]2P(7T2)  +  U3]3P(tt3) 

—  (7121  +  7212  +  7313) 

+  7121-Pll  +  7212-P22  +  7313^  33-  (20) 

As  He  etal.  point  out  [12],  this  expression  depends  on  only 
the  three  “sensitivities”  Pn,  P22,  and  P33,  rather  than  six 
GPDVs.  The  three  sensitivities  are  used  to  construct  the  ROC 
space  (equivalent  to  that  proposed  by  Mossman  [10])  in  which 
they  analyze  the  performance  of  their  observer.  That  observer 
in  turn  is  the  special  case  of  the  ideal  observer  obtained  by 
imposing  the  above  constraints  on  the  decision  utilities  U,\:j 
or,  equivalently,  on  the  parameters  jjij  . 

Applying  the  stated  constraints  on  the  utilities  to  the  ideal 
observer  decision  boundary  lines  given  in  (3)-(5)  yields 

7121  LR|  —  7212LR2  =  0  (21) 

7121LR1  =  7313  (22) 

7212LR2  =  7313-  (23) 

This  decision  strategy  is  illustrated  in  Fig.  2.  We  have  recently 

shown  [13]  that  this  decision  strategy  is  a  special  case  of  that 
proposed  by  Scurfield  [9]  when  the  decision  variables  used 
by  the  Scurfield  observer  are  the  logarithms  of  the  likelihood 
ratios  of  the  observational  data. 

We  now  consider  evaluating  the  performance  of  an  arbi¬ 
trary  observer  in  the  ROC  space  constructed  only  from  the 
observer’s  sensitivities  ( i .  e.,  P\  1 ,  P22,  and  P33).  Without  loss 


of  generality,  we  can  define  such  an  observer’s  ROC  surface 
as  P33  considered  as  a  function  of  Pn  and  P22;  to  find  the 
optimal  observer  with  respect  to  this  restricted  performance 
evaluation  method,  we  apply  the  Neyman-Pearson  criterion  to 
maximize  P33  subject  to  the  constraints  (Pn  =  an,P22  = 
0:22).  We  define  the  function 

Pjje  =  -P33  +  An(Pn  —  an)  +  \22{P22  ~  022),  (24) 

where  An  and  A22  are  again  the  Lagrange  multipliers. 

The  functional  in  (24)  is  maximized  in  App.  B.  The  bound¬ 
ary  lines  which  partition  the  (LRi,LR.2)  decision  variable 
plane  into  the  regions  Z\,  Z2,  and  Z3  are  found  to  be 


AuLRi  —  A22LR2 

=  0 

{“1-VS.-2”} 

(25) 

AuLRi 

=  1 

(“1-VS.-3”} 

(26) 

A22LR2 

=  1 

{“2- vs. -3”}. 

(27) 

If  we  require  An  and  A22  to  be  positive,  and  define  the 
quantities  7121  =  7313A11  and  7212  =  7313A22  for  some 
arbitrary  positive  constant  7313,  then  the  resulting  decision 
strategy  is  found  to  be  identical  to  that  stated  in  (21)-(23).  The 
special  case  of  the  ideal  observer  proposed  by  He  etal.,  whose 
performance  depends  only  on  the  conditional  classification 
rates  Pn,  P22,  and  P33  by  (20),  is  indeed  the  observer 
which  obtains  optimal  performance  with  respect  to  this  set 
of  conditional  classification  rates.  By  the  argument  at  the  end 
of  Sec.  III-A,  this  description  of  the  constrained  observer’s 
performance  is  complete. 

D.  The  Scurfield  Observer  (Likelihood  Ratio) 

In  the  preceding  two  sections,  we  considered  decision  strate¬ 
gies  that  have  been  proposed  by  other  researchers  as  special 
cases  of  the  three-class  ideal  observer  decision  strategy.  That 
is,  particular  constraints  were  explicitly  imposed  in  the  work 
cited  on  the  decision  utilities  used  by  the  ideal  observer.  The 
remaining  two  decision  strategies  we  consider  in  the  present 
work  are  special  cases  of  a  decision  strategy  proposed  by 
Scurfield  [9]  which  was  not  claimed  to  be  generally  related  to 
the  ideal  observer;  specifically,  Scurfield  specified  the  decision 
boundary  lines  used  by  the  observer,  but  made  no  assumptions 
concerning  the  observer’s  two  decision  variables. 

We  showed  recently  [13]  that  if  particular  forms  of  the 
observer’s  decision  variables  related  to  the  likelihood  ratios 
of  the  observational  data  are  chosen,  then  the  resulting  de¬ 
cision  strategies  can  be  shown  to  be  special  cases  of  the 
ideal  observer  decision  strategy.  One  such  special  case  is  the 
observer  analyzed  by  He  etal.  [12],  discussed  in  Sec.  III-C, 
in  which  the  decision  variables  used  by  the  Scurfield  observer 
are  the  logarithms  of  the  likelihood  ratios.  Two  other  such 
special  cases  are  the  Scurfield  observer  with  the  likelihood 
ratios  themselves  as  decision  variables,  which  we  consider  in 
this  section;  and  that  with  the  a  posteriori  class  membership 
probabilities  used  as  decision  variables,  considered  in  Sec.  III- 

E.  A  minor  difference  from  the  preceding  two  sections  is 
that  we  must  determine  the  implicit  constraints  on  the  ideal 
observer’s  utilities  from  the  known  form  of  the  decision  rule, 
rather  than  the  other  way  around. 
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Fig.  3.  A  special  case  of  the  decision  strategy  investigated  by  Scurfield,  in 
which  the  decision  variables  used  are  the  likelihood  ratios  (LR.i,LR.2)  of 
the  observational  data. 

The  general  Scurfield  observer  makes  decisions  by  parti¬ 
tioning  a  decision  variable  plane  (y!,y2)  into  three  regions 
via  the  decision  boundary  lines 


yi  -  V2 

F- 

<F~ 

II 

(28) 

yi 

=  7i 

(29) 

V2 

=  72, 

(30) 

where  71  and  72  are  parameters  upon  which  the  observer’s  per¬ 
formance  depends  (roughly  equivalent  to  the  decision  criterion 
of  a  two-class  classifier)  [9].  When  the  decision  variables  are 
themselves  the  likelihood  ratios  (LRi,  LR2),  this  becomes  in 
our  notation 


7121LR1  —  7121LR2 

=  7313  —  7323 

(31) 

7121LR1 

=  7313 

(32) 

7121LR2 

=  7323- 

(33) 

(Compare  (28)-(30)  with  (3)-(5),  and  note  that  in  order  for 
the  “l-vs.-2”  line  to  have  unit  slope,  it  must  be  the  case  that 
7121  =  7212-  Alternatively,  after  making  the  assignments  y3  = 
LRi,  y2  =  LR2  in  (28)— (30),  one  is  free  to  multiply  all  three 
equations  by  a  postive  constant  7121-)  This  decision  strategy 
is  illustrated  in  Fig.  3. 

The  relations  7121  =  7131  and  7212  =  7232  evident  from 
the  above  equations  immediately  give  the  constraints  on  the 
decision  utilities  t/2|i  =  U3 p  and  U3\2  =  U3\2.  Furthermore, 
the  constraint  7121  =  7212  implies  (U\\  1  —  CC2|  1  )-F>(7Ti )  = 
(U2 12  -  Ui\2)P(Tr2).  (Recall  from  Sec.  II  that  7^  =  (U^  - 
Uj  | j )  .P ( 77 ) ,)  This  allows  us  to  simplify  the  expression  for 
expected  utility  in  (2)  to  yield 

-^{Uscfd:LRl  =  U\\-iP{-K\)  +  U2\2P{'k2)  +  U3\3P(n3) 

—  7121  (-P21  +  P31)  —  7121(^12  +  P32) 
—  J313P13  —  I323P23 


=  Ui^P^i)  +  U2\2P{tt2)  +  U3\3P{tt3) 

—  27121  +  7121  (-P11  +  P22) 

—  7313-P13  —  7323  ^  23-  (34) 

This  expression  for  the  observer’s  expected  utility  depends 
on  only  three  GPDVs:  P13  and  P23,  which  are  just  the 
misclassification  rates  for  observations  actually  drawn  from 
class  713;  and  P\  \  +  P22,  which  may  be  regarded  as  the  “total 
sensitivity”  for  observations  actually  drawn  from  classes  717 
and  7T2  (ignoring  the  a  priori  rates  for  such  observations). 

We  now  consider  evaluating  the  performance  of  an  arbitrary 
observer  in  an  ROC-like  space  constructed  from  the  quantities 
Pn  +  P22,  P13,  and  P23.  We  will  define  the  ROC-like  surface 
used  to  evaluate  observer  performance  as  the  first  quantity 
considered  as  a  function  of  the  two  misclassification  rates.  To 
find  the  optimal  observer  with  respect  to  this  restricted  per¬ 
formance  evaluation  method,  we  apply  the  Neyman-Pearson 
criterion  to  maximize  P11  +  P22  subject  to  the  constraints 
(P13  =  Q73,  P23  =  a23).  We  define  the  function 

^Scfd:LR  —  Pn  +  -P22  ~  Ai3(Pi3  —  073) 

^  A23(P23  —  £*23);  (35) 

where  A13  and  A23  are  the  Lagrange  multipliers. 

The  functional  in  (35)  is  maximized  in  App.  C.  The  bound¬ 
ary  lines  which  partition  the  (LRi,LR2)  decision  variable 
plane  into  the  regions  Z 1,  Z2,  and  Z3  are  found  to  be 


LRi  —  LR2 

=  A13  —  A23 

{“1-VS.-2”} 

(36) 

LR.! 

=  A13 

{“1-VS.-3”} 

(37) 

lr2 

=  A23 

{“2-vs.-3”}. 

(38) 

If  we  require  A13  and  A23  to  be  positive,  and  define  the 
quantities  7313  =  7121A13  and  7323  =  7121A23  for  some 
arbitrary  positive  constant  7121,  then  the  resulting  decision 
strategy  is  found  to  be  identical  to  that  stated  in  (3 1)— (33).  This 
special  case  of  the  observer  proposed  by  Scurfield,  which  we 
have  shown  to  be  a  special  case  of  the  ideal  observer  [13],  has 
a  performance  that  depends  only  on  the  GPDVs  P11+P22,  P13, 
and  P23  by  (34).  This  is  indeed  the  observer  which  obtains 
optimal  performance  with  respect  to  this  set  of  quantities 
related  to  the  conditional  classification  rates.  By  the  argument 
at  the  end  of  Sec.  III-A,  this  description  of  the  constrained 
observer’s  performance  is  complete. 

E.  The  Scurfield  Observer  (a  posteriori  Class  Probability) 

Equations  (28)-(30)  in  Sec.  III-D  give  the  equations  for 
the  decision  boundary  lines  of  the  general  Scurfield  observer. 
If  we  now  use  two  of  the  a  posteriori  class  membership 
probabilities,  such  as  P(7Ti|x)  and  P(7t2|x),  as  the  decision 
variables,  the  equations  become 


P(tt i\x)  -  P( 7t2|x) 

=  71-72 

(39) 

P(tt  i\x) 

=  7l 

(40) 

P(7T2|f) 

=  72, 

(41) 

with  0  <  71  <  1  and  0  <  72  <  1.  (Note  that  P(n3\x)  = 
1  —  P(jri\x)  —  P(tt2\x),  meaning  this  third  probability  is 
not  needed  as  an  independent  decision  variable;  the  particular 


three  sections,  we  can  still  relate  the  parameters  71  and  72  to 
the  decision  rule  parameters  of  (3)-(5)  to  obtain  constraints 
on  them  and,  consequently,  on  the  utilities  U,yr  For  example, 
comparison  of  (47)  with  (4)  gives 


7232  “  7212  — 


-Pi*  2) 
P(tt3) 


7313) 


(49) 


which  can  also  be  expressed  in  terms  of  the  utilities  as 
—  (C/i|2  —  ££312)  =  ££313  —  ££i|3-  Similarly,  comparison  of  (48) 
and  (5)  gives 


7131  —  7121  = 


-P(*  1) 
P(tt3) 


73235 


(50) 


Fig.  4.  A  special  case  of  the  decision  strategy  investigated  by  Scurfield, 

in  which  the  decision  variables  used  are  the  a  posteriori  class  membership  which  can  be  expressed  in  terms  of  the  Utilities  as  \U2\l 
probabilities  P(tti|x)  and  P(7r2|x)  of  the  observational  data.  ££3|l)  =  ££3|3  —  £^213-  Finally,  we  add  the  first  two  coefficients 

of  (46)  and  then  compare  with  (3)  to  obtain 


choice  of  which  two  probabilities  to  use  is  of  course  arbitrary.) 
This  decision  strategy,  which  we  have  shown  recently  to  be 
a  special  case  of  the  ideal  observer  decision  strategy  [13],  is 
illustrated  in  Fig.  4. 

We  can  reexpress  the  above  equations  in  terms  of  likelihood 
ratios  by  exploiting  the  relation 


P{nt\x) 


p(x\TTi)P(TTi) 

p{x) 

fc,LR, 

1  +  AqLRi  +  fc2LR2  ’ 


(42) 


where  the  second  equation  is  obtained  by  dividing  the  numer¬ 
ator  and  denominator  of  the  first  by  p(a?|7T3)P(7r3),  and  where 
ki  =  P(yKi)  /  P{jk3).  The  equations  for  the  decision  boundary 
lines  become 


p  Qi) 
P(tt3) 


LRi 


p(n  2) 

P(n  s) 


LR2 


(7i 


72) 


P{*  1) 
P(tt3) 


LRi 


P{*  1) 

P(n  s) 


LRi 


P{^2) 
P(K  3) 


lr2 


P{^2) 

P(  p3) 


LR2  ] 


71 


72 


1  +  P7TLRl 


lr2 


P{^2) 

P(  p3) 

1  + 


P(tt3) 


P{*2) 

P(7T3) 


lr2 


(43) 


(44) 


(45) 


which  can  in  turn  be  simplified  to  yield 


[1  -  (71  -  72)]P(7Ti)LRi 
-[1  +  (71  -  72)]P(7r2)LR2 

=  (71  ^  l2)P{^3) 

(1  -  7i)P(7Ti)LRi  -  7iP(7r2)LR2 
-72P(7ri)LRi  +  (1  -  72)P(7T2)LR2 


(46) 

7lP(7r3)  (47) 
72P(7r3).(48) 


Although  the  above  equations  for  the  decision  boundary 
lines  are  much  more  complicated  than  those  of  the  previous 


7121  7212  _  ^2(7313-7323) 

P( Pi)  P(tt2)  “  P(7T3)  ’  ^  ; 

which  can  be  expressed  in  terms  of  the  utilities  as  (£4|i  — 
££2|i)  —  (££2|2  —  ££i|2)  =  — 2(E72|3  —  £Ti|3).  Note  that  the 
remaining  terms  in  (46)-(48)  involving  71  or  72  are  simply 
differences  of  terms  already  considered,  and  would  thus  yield 
no  further  constraints  on  the  utilities. 

We  can  now  impose  constraints  (49),  (50),  and  (51)  on 
the  general  expression  (2)  for  expected  utility  to  obtain  the 
expected  utility  for  this  observer: 


^{UscfdiApI 


££i|i-P(tti)  +  U2\2P(^2)  +  U3\3P(tT3) 


—  7121P21  — 
2P(tt2) 


P(tt2) 


7121 


+ 


P(tt3) 
7121  — 

P{^2) 

LA^i) 

2P(tt2) 


L  £’(-i) 

(7313  —  7323) 


Pi 


12 


p{*  1). 

P(n3) 


7323 


P 


7121 


“7323 


_  P(7r2)  . 
P(tt3) 

P32 


31 


7313 


P(tt3) 

—  7313P13  —  7323P23 
C£i|iP(7ri)  +  E/2|2P(p2)  +  U3\3P{n3) 
7121  [P(7r1)(P21  +  P3i) 

P32)} 


P(*  1) 
P(7T2)(Pl2 


-  [2P(7r2)P12  +  P(7t2)P32 

P(tt3) 

+  -P(P3)Pl3] 

■  P^t)  [-2P(P2)(Pi2  +  P32) 
-P(7ri)P3i+P(7r3)P23].  (52) 


This  can  in  turn  be  simplified  slightly  using  the  definition  of 
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conditional  probability  to  yield 

^{UscfdiAp}  =  ^i|i-P(tti)  +  U2i2P(n2)  +  U3\3P(n3) 


+ 


+ 


+ 

As  was  the  case  for  the 
three  sections,  the  expected  utility  of  this  observer  (and  thus  its 
performance,  as  it  too  is  a  special  case  of  the  ideal  observer) 
depends  on  only  three  GPDVs,  namely  the  quantities  in  square 
brackets  in  (53). 

The  first  GPDV,  being  a  weighted  sum  of  “sensitivities”  with 
positive  weights,  is  immediately  seen  to  be  quite  suitable  for 
the  dependent  variable  of  an  ROC  surface  —  a  higher  value 
of  this  quantity  is  clearly  preferable  to  a  lower  one.  (Indeed, 
+  P{n2)P22  has  an  intuitive  interpretation  as  the 
probability  of  a  randomly  drawn  observation  being  both  (i) 
from  either  class  7Ti  or  ti2  and  also  (ii)  correctly  classified 
as  such.  Compare  the  corresponding  quantity  Pn  +  P22  from 
Sec.  III-D,  which  is  technically  not  even  a  probability.)  The 
other  two  GPDVs  in  (53)  discourage  any  such  straightforward 
interpretation,  but  this  is  perhaps  to  be  expected:  the  pleasantly 
symmetric  form  of  the  Scurfield  decision  rule  of  (28)— (30)  in 
this  case  holds  in  the  (P(7Ti|x),P(7T2|x))  decision  variable 
plane;  due  to  the  complexity  of  the  transformation  in  (42), 
this  symmetry  will  be  lost  in  the  likelihood  ratio  decision 
variable  plane,  and  the  expression  for  expected  utility  will  be 
correspondingly  opaque.  (Despite  this  complexity,  it  is  worth 
emphasizing  that  the  Scurfield  decision  rule,  for  arbitrary 
choice  of  the  decision  variables,  has  the  advantage  that  it 
can  be  proven  rigorously  that  the  volume  under  any  of  the 
conventional  ROC  surfaces  proposed  by  Scurfield  is  equal  to 
the  probability  of  a  particular  outcome  of  a  three-alternative 
forced  choice  experiment  [9].  Although  it  is  possible  that  a 
different  choice  of  decision  rule  would  yield  a  more  “intuitive” 
triple  of  GPDVs  than  that  given  in  (53),  we  have  considered 
it  worthwhile  to  investigate  the  consequences  of  the  Scurfield 
decsion  rule  for  three  very  natural  choices  of  decision  vari¬ 
able  —  namely,  the  log-likelihood  ratios  investigated  by  He; 
the  likelihood  ratios  themselves;  and  the  a  posteriori  class 
membership  probabilities.) 

In  any  case,  we  now  consider  evaluating  the  performance  of 
an  arbitrary  observer  in  an  ROC-like  space  constructed  from 
the  quantities  P(7ri)Pii+P(7r2)P22,  2P(7r2)Pi2+P(7r2)P32+ 
P(tt3)Pl3,  and  2P(tt2)P22  -  P(7Ti)P3i  +  P{n3)P23.  We 
will  define  the  ROC-like  surface  used  to  evaluate  observer 
performance  as  the  first  quantity  considered  as  a  function 
of  the  other  two.  To  find  the  optimal  observer  with  re¬ 
spect  to  this  restricted  performance  evaluation  method,  we 
apply  the  Neyman-Pearson  criterion  to  maximize  P(7Ti)Pn  + 


P(7Tl)  +  P(7T2) 

P(Kl) 


7121 


2P(tt2) 

P(tt3) 


7323 


gfyy  [P(ni)Pu  +  P(tt2)P22] 
[2P(7t2)Pl2  +  P(tt2)P32 

P(7r3) 

P(7T3)Pl3] 

ygyy  [2P(^)P22  -  Puffer 

P(7r3)P23]  ■  (53) 

decision  strategies  of  the  preceding 


P(n2)P22  subject  to  the  constraints  2P(7T2)Pi2  +  P(7r2)P32  + 
P(tt3)Pl3  =  Oil  and  2P(n2)P22  -  P(7Ti)P3i  +  P(n3)P23  = 

a2.  We  define  the  function 


^Scfd:AP  =  P(tti)Pii  +  P{^2)P22 

—  Xi[2P(n2)P12  +  P(k2)P32 

+  P(>3)Pl3  -  Oil] 

—  A2[2P(7r2)P22  —  P(7Ti)P3i 

+  P(7r3)P23  —  a2],  (54) 


where  Ai  and  A2  are  the  Lagrange  multipliers. 

The  functional  in  (54)  is  maximized  in  App.  D.  The  bound¬ 
ary  lines  which  partition  the  (LRi,LR2)  decision  variable 
plane  into  the  regions  Z\,  Z2,  and  Z3  are  found  to  be 


3||LR1-(2AI-2A3  +  l)|(|)LR2 

=  (Ai  —  A2) 

(1-A2)^4lRi-Ai^4lR2 


P(7r3) 


P(tt3) 


-A2;S^4lR1  +  (Ax  -  2A2  +  1)§^LR2 


P  Os) 


P(tt3) 


(55) 
Ai  (56) 

A2  (57) 


If  we  define  the  quantities  7313  =  7i2i[P(7T3)/P(7ri)]Ai  and 
7323  =  7i2i[P(7r3)/P(7ri)]A2,  and  further  require  Ai  and  A2 
to  be  positive,  then  the  resulting  decision  strategy  is  found  to 
be 


7121LR1  — 


27313  27323  7121  1  ,  , 

fW  "  «  +  FM J  P('2)LR2 

=  7313  —  7323 


7121 


7323 


P(n  l). 

P(tt3) 

P(*i) 

P(tt3) 


LRi  —  2?7313LR2 


7323LR1  + 


P(tt3) 

=  7313 

7313  27323 

_P(tt 3)  P(tt  3) 

7121  iP(7t2)LR2 

=  7323- 


^l)J 


(58) 

(59) 


(60) 


This  is  in  fact  the  ideal  observer  subject  to  the  constraints 
in  (49)— (5 1);  that  is,  the  resulting  observer  is  identical  to  that 
stated  in  (39)-(41).  This  special  case  of  the  observer  proposed 
by  Scurfield,  which  we  have  shown  to  be  a  special  case  of  the 
ideal  observer  [13],  has  a  performance  that  depends  only  on 
the  quantities  P(7ri)Pii+P(7r2)P22,  2P(7r2)Pi2+P(7r2)P32+ 
P(tt3)Pl3,  and  2P(tt2)P22  -  P{ni)P3i  +  P{n3)P23  by  (53). 
The  observer  described  above  is  indeed  that  which  obtains 
optimal  performance  with  respect  to  this  set  of  quantities 
related  to  the  conditional  classification  rates.  By  the  argument 
at  the  end  of  Sec.  III-A,  this  description  of  the  constrained 
observer’s  performance  is  complete. 


IV.  Discussion 

Given  the  rapid  increase  in  complexity  of  the  utility  con¬ 
straints  and  performance  evaluation  criteria  as  one  proceeds 
from  Secs.  III-B  to  III-E,  it  is  quite  possible  for  the  main  point 
of  the  above  analyses  to  become  obscured.  That  main  point  is 
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that,  for  each  of  a  variety  of  constrained  special  cases  of  the 
three-class  ideal  observer,  the  performance  of  that  observer 
is  completely  describable,  in  an  expected-utility  sense,  by 
only  two  decision  criteria  and  three  quantities  related  to 
conditional  classification  rates.  This  represents  a  considerable 
simplification  from  the  general  model,  which  is  known  to 
involve  five  decision  criteria  and  six  conditional  classification 
rates.  Furthermore,  given  the  result  derived  in  Sec.  III-A,  this 
conclusion  can  be  seen  to  apply  to  any  set  of  GPDVs  obtained 
from  linear  constraints  on  the  ideal  observer’s  decision  utili¬ 
ties,  and  not  merely  the  four  special  cases  considered  explicitly 
here.  Put  another  way,  it  is  relatively  straightforward  to  see  that 
if  linear  restrictions  (i.  e.,  constraints)  of  the  form  described  in 
Sec.  III-A  are  placed  on  the  ideal  observer  decision  rule,  the 
performance  of  the  resulting  observer  will  be  describable  with 
less  than  five  degrees  of  freedom  (or,  in  general,  less  than 
N2  —  N  —1  in  an  TV-class  classification  task).  Moreover,  we 
have  shown  the  converse  to  be  true  as  well  for  a  wide  variety 
of  restricted  performance  evaluation  models:  if  one  chooses 
to  describe  observer  performance  with  fewer  than  six  (or,  in 
general,  fewer  than  N2  —  N )  GPDVs  that  are  linearly  related 
to  the  conditional  classification  probabilities,  then  the  observer 
which  optimizes  performance  with  respect  to  that  description 
is  a  restricted  form  of  the  ideal  observer  (where  the  restrictions 
correspond  to  linear  constraints  on  the  utilities).  Again,  this 
follows  directly  from  the  proof  in  Sec.  III-A  that  the  expected 
utility  and  Neyman-Pearson  optimization  methods  are  in  fact 
mathematically  equivalent. 

It  should  be  immediately  acknowledged  that  such  simplified 
models  may  ultimately  prove  to  be  of  limited  practical  impor¬ 
tance.  Given  an  observer  known  to  closely  approximate  the 
behavior  of  the  unrestricted  ideal  observer,  or  indeed  given 
a  human  observer,  it  is  difficult  to  conceive  of  a  pragmatic 
way  to  externally  constrain  the  observer’s  decision  utilities 
to  match  a  particular  model  such  as  one  of  those  described 
above.  On  the  other  hand,  an  algorithmic  observer  (such  as 
an  implementation  of  a  computerized  scheme  for  computer- 
aided  diagnosis)  might  readily  allow  such  constraints  on  its 
decision  rules  to  be  implemented;  however,  the  assumption 
that  the  probability  density  functions  of  the  decision  variables 
generated  by  the  scheme  do  indeed  follow  those  required 
by  the  ideal  observer  model  would  generally  be  unverifiable, 
given  the  limited  amount  of  data  typically  available  for  training 
and  testing  such  a  scheme. 

V.  Conclusions 

Despite  the  limitations  of  constrained  or  simplified  perfor¬ 
mance  evaluation  models  stated  in  the  preceding  section,  it 
remains  an  acknowledged  fact  that  a  fully  general  extension 
of  ROC  analysis  to  classification  tasks  with  three  or  more 
classes  has  yet  to  be  developed.  Although  the  investigation 
of  constrained  and  therefore  tractable  observer  performance 
evaluation  models  should  not  be  considered  an  end  unto  itself, 
a  thorough  understanding  of  such  models  is  almost  certain  to 
prove  necessary  for  the  development  of  more  general  observer 
models.  We  believe  that  demonstrating  particular  constrained 
ideal  observer  models  to  be  complete  as  well  as  tractable  will 
be  a  crucial  step  toward  this  understanding. 


Appendix  A 

The  Chan  et  al.  Observer 

As  stated  in  the  material  leading  up  to  (3)-(5),  observer 
decisions  here  are  assumed  to  be  made  based  on  statistically 
variable  observational  data.  Explicitly, 

Pij  =  f  p(x\TTj)  dmx,  (61) 

JZi 


where  Z.t  is  the  region  for  which  observations  x  (of  dimension 
to)  are  decided  to  belong  to  the  class  labeled  7Tj  (1  <  i  < 
3).  The  expression  for  E^han  'n  (16)  can  then  be  written  as 
follows: 


-^Chan 


1  —  P\  3  —  -P23  —  A31-P31  +  A31CI31  —  A32P32 
+  A32«32 

1  +  A31CI31  +  A32CI32  ~  {Pl3  +  P'2'i  +  A31P31 
+  A32P32} 


1  +  A3iO!3i  +  A32CC32 


[  p(x\TT3)dmx 
J  Z1 


f  Qlan  is  maximized  when  the  quantity  in  braces  is  minimized. 
This  quantity,  in  turn,  can  be  minimized  by  assigning  a  given 
x  to  the  region  Z,  such  that  the  ith  integrand  (from  among 
the  integrals  in  braces  in  (62))  is  minimal.  (Situations  in  which 
two  or  more  of  the  integrands  yield  the  same  minimal  value 
for  a  given  x  can  be  decided  in  an  arbitrary  but  consistent 
fashion.) 

That  is. 


decide  7Ti  iff 

p(x\tt3) 

< 

and 

p(x  |tt3) 

< 

decide  7r2  iff 

p(x  |tt3) 

< 

and 

p(x  |tt3) 

< 

decide  7T3  iff 

p(x  |tt3) 

> 

and 

p(x  |tt3) 

> 

p{x\tt3) 

A31p(x|7Ti)  +  A32p(T'|7r2)  (63) 

p{x\tt3) 

X3iP(x\tti)  +  A32p(T'|7r2)  (64) 
X3ip(x  |7Ti)  +  A32p(T'|7r2) 
A31p(x|7Ti)  +  A32p(T'|7r2).(65) 


We  can  divide  these  relations  by  p(x\tt3)  to  obtain 


decide  7Ti  iff 

OLRi  —  OLR.2 

> 

0 

and 

A31LR3  +  A32LR2 

> 

1 

(66) 

decide  7T2  iff 

OLRi  —  OLR.2 

< 

0 

and 

A31LR1  +  A32LR2 

> 

1 

(67) 

decide  tt3  iff 

A31LR3  +  A32LR2 

< 

1 

and 

A31LR3  +  A32LR2 

< 

1. 

(68) 

ime  without 

loss  of  generality  that 

•  P{- 

x\tt3) 

>  0, 

because  the  task  reduces  to  a  two-class  problem  for  values  of  x 
such  that  p(a;|7r3)  =  0.)  The  corresponding  decision  boundary 
lines  are  given  in  (17)— (19). 
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Appendix  B 

The  He  et  al.  Observer 

Using  (61),  the  expression  for  f  )[e  in  (24)  can  be  expressed 
as 

-Fpie  =  1  ~  Pi3  —  P23  +  An(l  —  P21  —  P31)  —  Anttii 

+  ^22(1  —  Pi 2  ~  P32 )  —  A22<a;22 
=  1  —  Anaii  —  A22«22  —  { P 1 3  +  P23 
+  Au(P2i  +  P31)  +  A22  (P12  +  P32)} 

=  1  —  Anan  —  A22&22 

-  |  J  [X22P(x\n2)  +p(f|7T3)]  dmX 

+  f  [Xup(x\ni) +p(x\n3)]  dmx 

Jz2 

+  J  [Xnpixln^  +  X22p{x\tt2)}  dmx^j  .  (69) 

Pj  [e  is  maximized  when  the  quantity  in  braces  is  minimized. 
This  quantity,  in  turn,  can  be  minimized  by  assigning  a  given 
x  to  the  region  Z,  such  that  the  ith  integrand  (from  among 
the  integrals  in  braces  in  (69))  is  minimal.  (Situations  in  which 
two  or  more  of  the  integrands  yield  the  same  minimal  value 
for  a  given  x  can  be  decided  in  an  arbitrary  but  consistent 
fashion.) 

That  is. 


decide  7Ti  iff 

A22p(^|7r2) 

< 

Xnp(x  7Tl) 

and 

p(x  |tt3) 

< 

Xup(x  7Tl) 

(70) 

decide  n2  iff 

Xlip(x  7Tl) 

< 

X22p(x  7T2) 

and 

p{x\tt3) 

< 

A22  p(x\tt2) 

(71) 

decide  7r3  iff 

Xlip(x  7Tl) 

< 

p{x\tt3) 

and 

X22p{x\tt2) 

< 

p{x  |7T3). 

(72) 

We  can  divide  these  relations  by  p(:r|7r3)  to  obtain 


decide  7Ti  iff 

AuLRi 

—  A22LR2 

> 

0 

and 

AuLRi 

> 

1 

(73) 

decide  7r2  iff 

AuLRi 

—  A22LR2 

< 

0 

and 

A22LR2 

> 

1 

(74) 

decide  7t3  iff 

AuLRi 

< 

1 

and 

A22LR2 

< 

1. 

(75) 

The  corresponding  decision  boundary  lines  are  given  in  (25)- 
(27). 


Appendix  C 

The  Scurfield  Observer  (Likelihood  Ratio) 

Using  (61),  the  expression  for  Pscfd  LR  'n  (35)  can  be 
written  as 


[p(x\tt2)  +  Ai3p(T|7T3)]  dmx 


+  [  [p(x |tti)  +  A23p(T|7r3)]  dmx 
Jz2 


+  J  [p(x\ni)  +  p(x\n2)\  dmf|  .  (76) 


Pgcfd  LR  's  maximized  when  the  quantity  in  braces  is  mini¬ 
mized.  This  quantity,  in  turn,  can  be  minimized  by  assigning 
a  given  x  to  the  region  Z,  such  that  the  ith  integrand  (from 
among  the  integrals  in  braces  in  (76))  is  minimal.  (Situations 
in  which  two  or  more  of  the  integrands  yield  the  same  minimal 
value  for  a  given  x  can  be  decided  in  an  arbitrary  but  consistent 
fashion.) 

That  is. 


decide  7Ti  iff 

p(x\ir2)  +  Ai3p(f|7r3)  <  p(x\ni)  +  A23p(T|7r3) 
and  Ai3p(f|7T3)  <  p(x\m)  (77) 

decide  7r2  iff 

p(x |7Ti)  +  A23p(f|7r3)  <  p(x\n2)  +  \i3p(x\tt3) 
and  A23p(f|7r3)  <  p(x|7r2)  (78) 

decide  7r3  iff 

P{x\n  1)  <  Ai3p(£|7T3) 

and  p(x |7t2)  <  A23p(:?|7r3).  (79) 


We  can  divide  these  relations  by  p(x |7r3)  to  obtain 


decide  m  iff 

LRi 

-lr2 

> 

A13  — 

A23 

and 

LRi 

> 

A13 

(80) 

decide  tt2  iff 

LRi 

-lr2 

< 

A13  — 

A23 

and 

lr2 

> 

A23 

(81) 

decide  7t3  iff 

LRi 

< 

A13 

and 

lr2 

< 

A23- 

(82) 

The  corresponding  decision  boundary  lines  are  given  in  (36)- 
(38). 


Appendix  D 

The  Scurfield  Observer  (a  posteriori  Class 
Probability) 

Using  (61),  the  expression  for  Pscfd-AP  i*1  (34)  can  be 
written  as 


^Scfd:AP  =  Aiai  +  A2a2 

+  P( 7ri)  f  p(x |7Ti)  dm x  +  P( 7t2)  f  p(x\tt2)  dmx 

•J  Z\  J  Z2 

2P(it2)  f  p(x\n2)  dmx 
J  z, 


—  Ai 


^Scfd:LR  =  1  —  P21  ~  P31  +  1  —  P12  —  P32  —  Ai3Pi3 

+  Ai3a33  —  A23P23  +  A23a23 
=  2  +  Ai3a33  +  A23a23  —  {P21  +  P31  +  P12 
+  P32  +  Al3Pl3  +  A23P23} 

=  2  +  Ai3a33  +  A23a23 


+  P(it2)  j  p{x\n2)  dmx  +  P( 7r3)  f  p(x\ir3)  dmx 
J  Z%  J  Z\ 

—  A2  2P(7t2)  [  p(x\ir2)dmx 
Jz2 

-P( 7Ti)  [  p(x\TTl)dmX 

Jz 3 
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+  P(tt3)  [  p(x\tt3)  dmx 

JZ2 


(83) 


Collecting  terms  with  given  domains  of  integration  yields 


^Scfd:AP  =  AiOr  +  ^202 

+  f  [P(7Tl)p(f|7Tl)  -  2AiP(7r2)p(f|7r2) 

Jz, 

-  AiP(7r3)p(x|7r3)]  dmx 

+  f  [P{n2)p{x\TT2)  -  2A2P(7r2)p(f|7r2) 

Jz2 

-  X2P(tt3)p(x\tt3)}  dm x 

+  [  [-X1P(tt2)p(x\tt2) 

Jz3 

+  A2P(7Ti)p(x|7Ti)]  dmx.  (84) 

Pgcfd-AP  can  maximized  by  assigning  a  given  x  to  the 
region  Zi  such  that  the  integrand  over  Zt  in  (84)  is  maximal. 
(Situations  in  which  two  or  more  of  the  integrands  yield  the 
same  maximal  value  for  a  given  x  can  be  decided  in  an 
arbitrary  but  consistent  fashion.) 

That  is. 


decide  7Ti  iff 

P(7Ti)p(£|7Ti)  -  2X1P(tt2)p(x\tt2)  -  X1P('k3)p(x\tt3 ) 

>  P(ir2)p(x\ir2)  -  2X2P(tt2)p(x\tt2)  -  X2P(tt3)p(x\tt3) 
and  P(7Ti)p(x|7Ti)  -  2XiP(tt2)p(x\tt2)  -  XiP(n3)p(x\Tr3) 

>  -AlP(7T2)p(f|7r2)  +  A2P(7Tl)p(f|7Tl)  (85) 

decide  7r2  iff 

P(tt2)p(x\tt2)  -  2X2P(tt2)p(x\tt2)  -  X2P(tt3)p(x\tt3) 
>P(iri)p(x\iri)  -  2XiP(tt2)p(x\tt2)  -  X1P(tt3)p(x\tt3) 
and  P(tt2)p(x\tt2)  -  2X2P(tt2)p(x\tt2)  -  X2P(n3)p(x\Tr3) 

>  -  AiP(7T2)p(f|7T2)  +  A2P(7Ti)p(f|7ri)  (86) 

decide  7r3  iff 

P(7Ti)p(f|7ri)  -  2XiP(tt2)p(x\tt3)  -  XiP(n3)p(x\Tr3) 

<  -  AlP(7T2)p(f|7T2)  +  A2P(7Tl)p(f|7Tl) 

and  P{tt2)p(x\tt2)  -  2X2P(tt2)p(x\tt2)  -  X2P(n3)p(x |7r3) 

<  -  AiP(7T2)p(f|7T2)  +  A2P(7Ti)p(x|7Ti).  (87) 

We  can  divide  these  relations  by  p(x\it3)  and  rearrange  terms 
to  obtain 


decide  m  iff  P(7Ti)LRi  —  (2Ai  —  2A2  +  l)P(7t2)LR2 

>  (Ai  —  A2)P(7t3) 

and  (1  —  A2)P(7Ti)LRi  —  AiP(7r2)LR2 

>  AiP(t r3)  (88) 

decide  7t2  iff  P(7Ti)LRi  —  (2A3  —  2A2  +  l)P(7t2)LR2 

<  (Ai  -  A2)P(7t3) 

and  -A2P(7rr)LRi  +  (Ai  -  2A2  +  1)P(7t2)LR2 

>  A2P(tt3)  (89) 

decide  7t3  iff  (1  —  A2)P(7Ti)LRi  —  AiP(7r2)LR2 

<  AlP(7T3) 

and  — A2P(7Ti)LRi  +  (Ai  -  2A2  +  l)P(7r2)LR2 

<  A2P(tt3).  (90) 


The  corresponding  decision  boundary  lines  are  given  in  (55)- 
(57). 
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ABSTRACT 

We  have  shown  previously  that  an  obvious  generalization  of  the  area  under  an  ROC  curve  (AUC)  cannot  serve 
as  a  useful  performance  metric  in  classification  tasks  with  more  than  two  classes.  We  define  a  new  performance 
metric,  grounded  in  the  concept  of  expected  utility  familiar  from  ideal  observer  decision  theory,  but  which 
should  not  suffer  from  the  issues  of  dimensionality  and  degeneracy  inherent  in  the  hypervolume  under  the  ROC 
hypersurface  in  tasks  with  more  than  two  classes.  In  the  present  work,  we  compare  this  performance  metric 
with  the  traditional  AUC  metric  in  a  variety  of  two-class  tasks.  Our  numerical  studies  suggest  that  the  behavior 
of  the  proposed  performance  metric  is  consistent  with  that  of  the  AUC  performance  metric  in  a  wide  range  of 
two-class  classification  tasks,  while  analytical  investigation  of  three-class  “near-guessing”  observers  supports  our 
claim  that  the  proposed  performance  metric  is  well-defined  and  positive  in  the  limit  as  the  observer’s  performance 
approaches  that  of  the  guessing  observer. 

Keywords:  ROC  methodology,  expected  utility,  three-class  classification 

1.  INTRODUCTION 

We  are  attempting  to  extend  the  well-known  observer  performance  evaluation  methodology  of  receiver  operating 
characteristic  (ROC)  analysis1,2  to  classification  tasks  with  three  or  more  classes.  This  could  conceivably  be  of 
benefit,  for  example,  in  a  medical  decision-making  task  in  which  a  region  of  a  patient  image  must  be  characterized 
as  containing  a  malignant  lesion,  a  benign  lesion,  or  only  normal  tissue.3 

Unfortunately,  a  fully  general  but  tractable  extension  of  ROC  analysis  to  tasks  with  more  than  two  classes 
has  yet  to  be  developed.  It  is  known  that  the  performance  of  an  observer  in  a  classification  task  with  N 
classes  ( N  >  2)  can  be  completely  described  by  a  set  of  N2  —  N  conditional  error  probabilities,4, 5  and  that  the 
performance  of  the  ideal  observer  (that  which  minimizes  Bayes  risk4)  is  completely  characterized  by  an  ROC 
hypersurface  in  which  these  conditional  error  probabilities  depend  on  a  set  of  N2  —  N  —  1  decision  criteria.5 
Although  analytic  expressions  for  the  ideal  observer’s  conditional  error  probabilities  given  reasonable  models  for 
the  underlying  observational  date  have  been  worked  out  in  the  two-class  case,6  this  has  not  yet  been  accomplished 
in  a  fully  general  manner  for  tasks  with  three  or  more  classes. 

Furthermore,  we  have  shown  that  an  obvious  generalization  of  the  area  under  the  ROC  curve  (AUC)  does  not 
in  fact  yield  a  useful  performance  metric  in  tasks  with  three  or  more  classes.7  In  the  formulation  we  advocate, 
the  set  of  N2  —  N  conditional  error  probabilities  serve  as  the  axes  of  the  observer’s  ROC  space.  This  is  equivalent 
to  plotting  a  two-class  observer’s  false-negative  fraction  (FNF),  rather  than  the  more  conventional  true-positive 
fraction  (TPF),  as  a  function  of  false-positive  fraction  (FPF)  to  construct  the  observer’s  ROC  curve.  Since 
FNF  =  1  —  TPF,  this  yields  an  ROC  curve  which  is  simply  an  “upside-down”  version  of  the  conventional  curve, 
and  the  area  under  this  ROC  curve  (which  we  will  denote  A)  is  just  one  minus  the  conventionally  defined  AUC. 
Clearly  this  area  will  vary  from  0.5,  for  a  “guessing”  observer,  to  0,  for  a  “perfect”  observer.  In  a  task  with  more 
than  two  classes,  however,  we  showed  that  although  the  “hypervolume  under  the  ROC  hypersurface”  (HUH)  is 
again  0  for  a  perfect  observer,  the  HUH  of  a  guessing  observer  is,  counterintuitively,  also  0.7  (Briefly,  the  number 
of  degrees  of  freedom  of  the  guessing  observer’s  ROC  hypersurface  is  N  —  1  rather  than  N2  —  N  —  1,  yielding 
a  “degenerate”  hypersurface  with  no  hypervolume,  much  as  in  three  dimensions  the  integral  under  a  “surface” 
which  is  actually  a  curve  —  e.g.,  z  =  f(x,y)  where  y  =  g(x)  —  will  be  zero.) 
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What  is  needed  is  a  performance  metric  that  shares  the  useful  properties  of  AUC,  namely  its  intuitive  direct 
relationship  to  the  “difficulty”  of  the  observer’s  task  (“near-guessing”  observers  have  an  A  near  0.5,  “near-perfect” 
observers  have  an  A  near  0),  without  suffering  from  this  drawback  of  degeneracy.  We  have  begun  to  investigate  a 
performance  metric  that  has  its  origins  in  the  “expected  utility”  concept  fundamental  to  ideal  observer  decision 
theory,4  and  which  we  have  reason  to  believe  is  both  related  to  HUH  and  yet  not  plagued  by  the  degeneracy 
issues  of  the  HUH.  In  the  next  section,  we  attempt  to  motivate  this  performance  metric,  the  “surface-averaged 
expected  cost”  (SAEC),  and  derive  theoretical  properties  of  this  quantity.  In  Sec.  3,  we  outline  the  simulation 
studies  we  implemented  in  a  number  of  simple  two-class  classification  tasks;  the  results  of  those  studies  are 
presented  in  Sec.  4.  The  implications  and  limitations  of  the  proposed  metric  are  discussed  in  Sec.  5,  and  we 
summarize  our  conclusions  in  Sec.  6. 


2.  THEORY 

In  a  two-class  classification  task,  with  the  classes  labeled  “7r+”  (“positive”)  and  “7r_”  (“negative”),  the  expected 
utility  of  an  observer  can  be  written  as4 

E{  U}  =  (UtpTPF  +  UfnFNF)P(tt+)  +  {UFP  FPF  +  PTArTNF)P(7r_),  (1) 

where  TPF  is  the  probability  of  deciding  an  observation  is  positive,  conditional  on  it  actually  being  drawn  from 
class  7T+,  more  explicitly  denoted  as  P(d  =  7r+|t  =  7r+);  FNF  is  P(d  =  7r_|t  =  7r+);  FPF  is  P(d  =  71  + |t  =  7r_); 
and  TNF  is  the  true-negative  fraction,  or  P(d  =  7r_|t  =  7 r_).  Each  U  represents  the  utility  of  a  particular 
decision  under  a  particular  truth  condition.  (We  use  a  bold  typeface  to  denote  statistically  variable  quantities, 
and  here  t  denotes  the  true  class  to  which  a  randomly  sampled  observation  belongs,  while  d  denotes  the  decision 
made  for  that  observation.) 

In  a  classification  task  with  an  arbitrary  number  of  classes  N,  with  labels  running  from  7Ti  to  7 t/v,  the  above 
expression  is  readily  generalized  to  obtain 


N  N 

£{u}  =  EE(Wfe)-  (2) 

7=1  *=  1 

where  we  have  written  the  observer’s  conditional  classification  rates  P(d  =  7iy|t  =  nj)  simply  as  P,j.  From  the 
rules  for  conditional  probability,8  JP  Pl3  =  1,  and  so  we  can  rewrite  this  expression  to  obtain 

N 

P{U}  =  X^liPfa) 

i= 1 

N  N 

~  TiUjU  ~  Ui\j)P(nj)Pij 

j=1  5J 

N  N 

=  Up  ~  'y  '  ijjjPjj  i  (3) 

7  =  1  *=\ 

where  Uq  is  just  the  expression  C/,|jP(7Tj)  (independent  of  the  conditional  error  rates  Pl3  which  describe  the 
observer’s  performance),  and  7 jij  =  (Uj y  —  P,;|?-)P(7r))  gives,  to  within  an  arbitrary  scale  factor,  the  set  of 
N2  —  N  —  1  decision  criteria  used  by  the  ideal  observer  to  make  decisions.5, 9-11  Note  that  the  7 jij  are  strictly 
positive  if  we  impose  the  reasonable  assumption  that  an  incorrect  utility  will  always  have  a  smaller  utility  than 
the  corresponding  correct  decision.  If  we  now  define  the  “normalized”  utility  (more  precisely,  if  we  choose 
particular  units  in  which  to  “measure”  utility)  as 
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and  similarly  define  70  =  t/o/(y^7j,J)1^2,  we  can  simplify  the  expression  for  expected  utility  further  to  obtain 

£{u}  =  7o-7  '  P-  (5) 

Here  P  is  an  (TV2  —  TV)-dimensional  vector  whose  components  are  the  conditional  error  rates  Pij  (with  a  specified 
ordering,  e.g.,  (P12,  -P13,  •  •  • ,  Pin,  P21,  ■  ■  ■ ,  Pn(n- 2),  Pn(n-i)))  —  ie.,  the  coordinates  of  ROC  space;  and  7  is 
a  unit  vector  of  the  same  dimensionality  as  P ,  whose  components  are  the  corresponding  values  of  7 jij  after 
normalization. 

It  is  important  to  keep  in  mind  that  although  this  normalized  expected  utility  is  optimized  only  by  the 
ideal  observer,  it  is  well-defined  for  any  observer  at  a  particular  operating  point  P  and  choice  of  (normalized) 
utilities  via  7.  Furthermore,  assuming  the  values  of  the  observational  priors  P(7q)  to  be  fixed  and  the  values 
of  the  utilities  to  be  determined  externally  to  the  observer  (i.  e.,  not  modifiable  by  the  observer  within  a  given 
experiment  or  set  of  experiments) ,  maximizing  the  normalized  expected  utility  is  clearly  equivalent  to  minimizing 
7  •  P .  We  will  refer  to  this  latter  quantity  as  the  expected  cost;  note  that  although  “cost”  has  a  far  more  general 
definition  in  the  literature  (as  do  “utility,”  “risk,”  etc.),  we  will  attempt  to  avoid  confusion  here  by  using  the 
term  only  in  this  restricted  sense. 

Suppose  we  have  measured  the  set  of  all  possible  values  of  PN(N_i )  for  a  given  observer  as  a  function  of  the 
other  N2  —  TV  —  1  conditional  error  probabilities.  (For  the  ideal  observer,  this  can  be  conceived  of  as  measuring  P 
for  every  possible  value  of  7;  for  a  non-ideal  observer,  we  assume  that  we  can  modify  whatever  set  of  TV 2  —  TV  —  1 
decision  criteria  it  is  actually  using,  even  if  these  are  not  usefully  related  to  the  utilities.)  We  write  this  as 

Pn(n-i)  —  P(Pi2,  P13,  ■  ■  ■ ,  Pin,  P21,  ■  ■  ■ ,  Pn{n-3),  Pn{n-2)) 

=  R(P*),  (6) 

where  P*  denotes  the  “reduced”  vector,  of  dimensionality  N2  —  TV  —  1,  obtained  by  deleting  the  ( TV 2  —  TV)th 
component  of  P .  The  HUH  can  be  defined7  as 


HUH  =  f  R(P*)dN2~N~1P*, 


(7) 


or  equivalently, 

HUH  =  J  dN'2~NP,  (8) 

Vr 

where  fin  denotes  the  set  of  P*  for  which  R(P*)  is  defined  (the  domain  of  the  function  defining  the  ROC 
hypersurface),  and  Vr  denotes  the  set  of  all  P  enclosed  by  that  hypersurface  and  by  the  boundaries  of  the  ROC 
space  (given  that  0  <  PVI  <  1).  Note  that  in  a  two-class  task,  with  the  ROC  curve  given  by  FNF  =  i?(FPF), 
this  reduces  to 
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as  expected.  (Note  that,  as  stated  in  Sec.  1,  this  is  one  minus  the  conventional  AUC  that  would  be  obtained  by 
integrating  TPF  as  a  function  of  FPF.) 


Despite  the  long-standing  success  of  AUC  as  a  summary  performance  metric  for  ROC  analysis,  we  have  shown 
the  HUH  not  to  be  useful  for  this  purpose  in  a  classification  task  with  three  or  more  classes.7  Briefly,  a  “perfect” 
observer  can  achieve  values  of,  say,  Pn(n-i)  =  0  for  any  achievable  set  of  P*;  by  Eq.  7,  the  HUH  for  such  an 
observer  will  thus  be  zero  (and  will  approach  zero  for  a  “near-perfect”  observer).  A  “guessing”  observer  will 
assign  observations  to  the  N  classes  randomly,  independent  of  the  actual  truth  states  of  those  observations;  since 
the  total  probability  of  making  a  decision  will  be  one,  this  leaves  a  set  of  only  N  —  1  degrees  of  freedom  (each  of 
the  probabilities  of  assigning  an  observation  to  a  given  class).  But  it  can  be  shown  that  in  such  a  situation,  the 
resulting  domain  of  integration  Qr  is  “degenerate,”  and  the  integral  in  Eq.  7  is  zero  regardless  of  the  value  of 
the  integrand  (and  will  approach  zero  for  a  “near-guessing”  observer).  Thus,  opposite  extremes  of  performance 
result  in  similar  or  identical  values  of  HUH,  making  this  quantity  useless  even  as  a  summary  performance  metric 
in  classification  tasks  with  more  than  two  classes.  In  the  two-class  case,  N 2  —  N  —  1  =  N  —  1  =  1,  of  course,  and 
(amusingly  or  providentially,  depending  perhaps  on  one’s  worldview)  no  such  degeneracy  is  encountered. 

Discouraging  though  this  result  may  be,  it  immediately  brings  to  the  forefront  the  question  of  what  motivated 
the  choice  of  AUC  as  a  summary  performance  metric  to  begin  with.  In  the  present  context,  it  can  be  said  that 
AUC  averages  directly  over  “performance  description  variables”  (such  as  FNF)  without  regard  to  utility  (or, 
equivalently,  cost).  For  an  experiment  involving  a  human  observer  (the  internals  of  whose  decision-making  process 
may  be  unavailable  to  experimenter  control)  or  an  algorithmic  observer  (trained  on  a  finite  sample  of  observational 
data),  the  actual  “costs”  may  be  unknown  to  the  experimenter,  or  may  not  be  available  for  modification  in  any 
practical  sense.  On  the  other  hand,  ideal  observer  decision  theory  demonstrates  the  tremendous  theoretical  and 
practical  importance  of  Eq.  5,  and  it  is  natural  to  ask  whether  consideration  of  the  expected  cost,  7  •  P,  might 
not  be  worthwhile,  given  the  difficulty  in  generalizing  AUC  just  described. 

For  the  ideal  observer  itself,  this  line  of  inquiry  seems  quite  promising  indeed.  For  each  possible  value  of 
7,  the  ideal  observer  will  choose  an  operating  point  P  that  minimizes  the  expected  cost.  (It  is  possible,  given 
particular  forms  of  the  observational  data  probability  density  functions  (PDFs),  that  multiple  operating  points 
P  will  be  associated  with  a  given  7;  it  can  be  shown,  however,  that  such  points  will  always  lie  in  a  simply 
connected  region,  analogous  to  a  straight  line  along  a  two-class  ROC  surface.  We  will  not  consider  such  special 
cases  here.)  By  taking  the  ideal  observer’s  ROC  hypersurface  as  given,  one  can  proceed  in  the  opposite  direction: 
at  any  given  point  on  the  ideal  observer’s  ROC  surface,  the  appropriate  7  for  that  point  is  that  which  minimizes 
the  expected  cost.  This,  in  turn,  can  be  shown  to  imply  that  the  appropriate  7  is  normal  to  the  ideal  observer’s 
ROC  hypersurface  at  each  point  P. 

For  non-ideal  observers,  the  situation  is  much  more  confusing.  Given  that  such  an  observer  might  not  be 
basing  its  decisions  on  the  utilities  (available  to  the  ideal  observer)  at  all,  it  is  unclear  what  value  of  7  to  assign 
to  a  given  P  on  such  an  observer’s  ROC  surface.  Arbitrarily,  we  choose  to  make  the  same  assignment  made  by 
the  ideal  observer:  at  each  point  on  the  observer’s  ROC  hypersurface,  we  choose  that  value  of  7  that  is  normal 
to  the  ROC  hypersurface  at  that  point.  Intuitively,  this  can  be  taken  to  be  giving  the  non-ideal  observer  the 
“benefit  of  the  doubt”:  in  determining  a  total  expected  cost  for  the  observer,  we  will  at  each  point  take  the 
contribution  to  that  cost  to  be  the  “minimum”  possible.  Alternatively,  we  can  say  that  the  observer  under  this 
model  is  at  least  behaving  “locally”  optimally. 

Thus,  for  the  ROC  hypersurface  given  in  Eq.  6,  we  define  the  “local”  utility  vector  to  be 
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where  the  expression  in  parentheses  denotes  a  vector  of  dimension  N 2  —  N  whose  first  N2  —  N  —  1  components 
are  the  negatives  of  the  components  of  Vi?;  the  sign  is  chosen  because  the  components  of  7r  must  be  positive, 
ruling  out  the  possibility  (Vi?,  —1).  We  use  this  definition  to  construct  the  surface  integral 
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The  integral  is  over  the  ROC  hypersurface  or,  that  is,  the  set  of  points  P  such  that  Pn(n-i)  =  P{P*)- The 
differential  element  on  this  hypersurface  is  denoted  by  dN  ~N~1o,  where  the  superscript  reminds  us  of  the 
dimensionality  “within”  that  surface. 

In  the  two  class  case,  the  differential  element  reduces  to  the  differential  arc  length,  which  we  can  define  as 
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The  integral  in  Eq.  11  can  then  be  written  as 
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Note  that  in  the  next  to  last  step,  the  negative  sign  has  disappeared  because  FNF  =  0  when  FPF  =  1  and  vice 
versa,  so  that  the  order  of  the  limits  of  integration  will  be  reversed.  It  is  also  vital  to  remember  that  A  here 
denotes  the  area  under  the  “upside-down”  ROC  curve  (FNF  plotted  against  FPF),  and  is  thus  one  minus  the 
conventional  AUC. 

Clearly  the  quantity  we  have  defined  is  directly  related  to  performance  —  in  fact,  far  more  closely  than  we 
had  reason  to  hope:  despite  our  ad  hoc  choice  of  7r,  the  relation  in  Eq.  13  holds  for  arbitrary  observers,  and 
not  just  ideal  observers.  Even  more  surprisingly,  the  generalization  of  this  relationship  can  be  shown  to  hold  for 
observers  in  tasks  with  arbitrary  numbers  of  classes.  Returning  to  Eq.  11,  we  rearrange  terms  to  obtain 
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Here  we  have  used  the  n-dimensional  extension  of  the  divergence  theorem  (known  in  three  dimensions  as  Gauss’s 
theorem);12  div  is  the  operator  / dPi),  which  when  applied  to  the  vector  P  will  simply  yield  the  dimensional¬ 
ity  N2  —  N  of  P.  Note  also  that  in  the  first  step,  we  have  “closed”  the  ROC  hypersurface  with  the  boundary  OVr 
of  the  ROC  hypervolume;  this  can  be  done  for  the  given  integrand,  because  the  “bottom”  surface  Pn(n-i)  =  0 
will  contribute  nothing  to  the  surface  integral. 

Unfortunately,  we  are  now  back  where  we  started:  since  it  is  equal  (to  within  a  proportionality  constant)  to  the 
HUH,  the  surface  integral  defined  above  will  have  exactly  the  same  drawbacks  as  that  quantity.  However,  writing 


the  performance  metric  in  this  form  —  as  an  integral  of  the  scalar  quantity  7#  •  P  over  the  ROC  hypersurface 
—  suggests  a  different  approach,  namely,  considering  an  “average”  of  this  quantity  over  the  hypersurface: 
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where  we  have  divided  the  previous  quantity  by  the  “surface  area”  of  the  ROC  hypersurface.  The  quantity  Ca 
is  the  SAEC  referred  to  in  Sec.  1;  the  overline  reminds  us  that  it  is  an  expectation  value,  and  the  subscript  a 
reminds  us  that  it  is  averaged  over  a  surface  (the  ROC  hypersurface).  This  can  be  considered  analogous  to  the 
concept  from  univariable  calculus  of  the  “average”  of  a  function  over  an  interval: 


/avg  = 


fix)  dx. 


In  particular,  it  should  be  immediately  clear  that  C a  is  bounded  by  the  maximum  and  minimum  values  of  7^  •  P, 
and  that  if  ■Jr  ■  P  were  constant  over  a  given  ROC  hypersurface,  then  Ca  would  be  equal  to  this  constant  value. 

Further  analysis  will  need  to  be  performed  to  confirm  that  this  quantity  remains  well-defined  for  guessing 
or  even  “near-guessing”  observers.  We  have  reason  to  believe  that  an  extension  of  L’Hopital’s  rule  should  be 
applicable  in  this  case;  i.e.,  although  the  numerator  and  denominator  will  both  converge  to  zero  in  the  limit  of 
approach  to  a  guessing  observer,  the  limit  of  Ca  itself  should  still  be  a  non-zero  quantity.  Our  results  in  this 
regard,  however,  are  still  very  preliminary.  For  the  present  work,  we  will  consider  only  properties  of  this  quantity 
in  the  two-class  case,  where  the  degeneracy  issues  involving  HUH  do  not  arise.  In  the  two-class  case,  of  course, 
we  can  use  Eq.  13  to  write 

(17) 

where  S  is  the  arc-length  along  the  ROC  curve. 


3.  MATERIALS  AND  METHOD 

We  numerically  investigated  the  behavior  of  Ca  compared  with  the  conventional  AUC  under  two  models  for  the 
distributions  of  the  observer’s  latent  decision  variable  data:  the  “conventional”  binormal  model,13  and  the  ideal- 
observer-related  “proper”  binormal  model.6  Under  the  conventional  model,  the  observer’s  decision  variables  are 
assumed  to  be  drawn  from  a  pair  of  distributions  which  are  an  (unspecified)  monotonic  transformation  of  two 
normal  distributions: 

x+  ~  N{x]  n+ =  a/b,a+ =  1/b)  (18) 

and 

x_  ~  N{x\  H-  =  0,  <t_  =  l)f  (19) 

where  is  a  normal  density  function  with  mean  /r  and  standard  deviation  a.  The  observer  makes 

decisions  by  comparing  an  observation  of  unknown  class  x  with  a  threshold  Xq',  varying  this  threshold  from  —00 
to  00  will  sweep  out  the  observer’s  ROC  curve.  This  curve  is  completely  specified  by  the  two  parameters  a  and 
b ,  and  analytic  forms  exist  for  both  individual  operating  points  (FPF,TPF)  and  the  conventional  AUC  (denoted 
Az  under  this  model)  as  functions  of  a  and  b. 13 

Under  the  “proper”  binormal  model,  the  observer  is  again  assumed  to  make  decisions  using  underlying  data 
monotonically  related  to  the  pair  of  distributions  given  in  Eqs.  18  and  19.  However,  the  actual  decisions  are 

made  by  comparing  the  likelihood  ratio  of  x,  rather  than  x  itself,  with  a  threshold.  The  likelihood  ratio  is  given 

by 

_  iV(x;  a/b ,  1/6) 
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Figure  1.  Isopleths  of  the  A,  performance  metric  (solid  lines)  and  of  the  proposed  CCT  metric  (dash-dotted  lines),  for 
various  values  of  the  a  and  b  parameters  of  the  conventional  binormal  model. 


Varying  the  threshold  yo  throughout  its  range  will  sweep  out  the  observer’s  ROC  curve.  For  numerical  purposes, 
it  has  been  found  convenient  to  parametrize  this  curve  using  the  parameters 
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rather  than  a  and  b  directly.  The  observer’s  ROC  curve  is  completely  specified  by  c  and  dai  and  analytic  forms 
have  been  determined  for  both  individual  operating  points  (FPF,TPF)  and  the  conventional  AUC  under  this 
model  as  functions  of  those  two  parameters.6 

We  calculated  the  Az  of  an  observer  assumed  to  operate  under  the  conventional  binormal  model  for  250  values 
of  a  distributed  uniformly  between  0  and  5,  and  (at  each  such  value  of  a)  for  250  values  of  b  distributed  uniformly 
between  0  and  2.  For  each  of  these  62,500  pairs  of  parameter  values,  we  also  calculated  the  corresponding  value  of 
C „  using  the  relation  in  Eq.  17.  (The  arc  length  S  was  calculated  by  generating  a  large  number  of  operating  points 
along  the  curve,  and  adding  together  the  line  segment  lengths  ^/{FPFi^WPFiZi^+jTPFi^T^PFiZiy.) 

A  similar  procedure  was  performed  for  the  proper  binormal  model.  We  calculated  the  conventional  AUC  for 
each  of  250  values  of  c  distributed  uniformly  between  —1  and  1,  and  (at  each  such  value  of  c)  for  250  values  of  da 
uniformly  distributed  between  0  and  4.  For  each  of  these  62,500  pairs  of  parameter  values,  we  also  calculated  the 
corresponding  value  of  Ca  (again  using  the  approximation  for  arc  length  described  for  the  conventional  model) . 


4.  RESULTS 

The  calculated  values  of  Az  and  of  Ca  for  the  conventional  binormal  model  are  shown  in  isopleth  (“contour”) 
plots  in  Fig.  1.  Similarly,  the  calculated  values  of  the  conventional  AUC  and  C a  for  the  proper  binormal  model 
are  shown  in  isopleth  plots  in  Fig.  2. 

Although  difficult  to  discern  from  the  plot,  the  isopleths  in  Fig.  1  do  in  fact  cross,  particularly  in  the  lower 
left  region.  For  example,  the  parameter  pair  (a  =  0.4819,  b  =  0.5060)  corresponds  to  an  Az  value  of  0.6663  and 
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Figure  2.  Isopleths  of  the  conventional  AUC  performance  metric  (solid  lines)  and  of  the  proposed  Ca  metric  (dasli-dotted 
lines),  for  various  values  of  the  c  and  da  parameters  of  the  proper  binormal  model. 


Figure  3.  ROC  curves  generated  under  the  conventional  binormal  model  with  parameter  values  of  (a  =  0.4189,  b  =  0.5060) 
(solid  curve),  and  (a  =  0.2410,  b  =  0.0080)  (dasli-dotted  curve). 

a  Ca  of  0.4315,  while  the  parameter  pair  (a  =  0.2410,  b  =  0.0080)  corresponds  to  an  ROC  curve  which  has  both 
a  lower  Az  of  0.5950  and  a  lower  Ca  of  0.4095.  (Recall  that,  as  its  name  implies,  the  SAEC  C a  is  a  “cost”,  and 
thus  lower  values  are  intended  to  be  “preferable,”  in  contrast  to  Az  and  the  conventional  AUC.)  These  two  ROC 
curves  are  plotted  (conventionally,  using  TPF  as  the  ordinate)  in  Fig.  3. 


5.  DISCUSSION 


It  is  evident  from  Fig.  1  that  the  proposed  performance  metric  Ca  does  not  perform  identically  to  the  conventional 
AUC  in  all  situations  (i.e.,  for  arbitrary  decision  rules).  This  is  illustrated  in  more  detail  in  Fig.  3;  if  the  two 
curves  represented  observers  (radiologists  or  imaging  systems,  for  example)  which  one  wished  to  rank  in  order 
of  performance,  then  the  two  performance  metrics  would  disagree  as  to  which  were  actually  preferable.  This  is 
understandable  given  the  shapes  of  the  curves;  the  system  with  slightly  lower  A~  is  so  severely  “hooked”  that  its 
arc  length  will  be  very  close  to  two,  driving  down  the  “cost”  Ca  to  a  greater  extent  than  the  loss  in  conventional 
AUC. 

It  should  be  recalled,  however,  that  in  practical  situations  in  which  such  a  severe  “hook”  is  seen  in  the  ROC 
curve,  the  observational  data  themselves  do  not  usually  support  such  a  fitting  of  the  curve.6  Even  aside  from 
such  data  sampling  and  curve-fitting  issues,  comparing  two  systems  when  at  least  one  of  them  has  an  ROC  curve 
with  such  a  large  “hook”  is  often  problematic  (compare  the  well-known  situation  when  two  systems  have  very 
similar  AUCs,  but  “cross,”  making  the  decision  of  which  system  to  prefer  dependent  on  the  region  of  ROC  space 
in  which  one  chooses  to  operate).  In  short,  the  fact  that  C „  does  not  agree  exactly  with  a  performance  metric 
such  as  A~,  itself  known  to  be  imperfect,  is  not  necessarily  a  fatal  flaw. 

The  results  presented  in  Fig.  2  are  far  more  surprising.  There  appear  to  be  no  visible  “crossings”  of  the 
isopleths  for  any  choices  of  parameters  c  and  da.  Although  this  result  still  needs  to  be  confirmed  analytically,  it 
would  if  found  true  imply  that  Ca  and  the  conventional  AUC  under  the  proper  binormal  model  are  equivalent 
performance  metrics.  Whether  this  equivalence  could  be  extended  to  arbitrary  ideal  observer  models  (i.e.,  those 
for  arbitrary  PDFs  rather  than  the  binormal  model)  would  also  be  an  important  area  for  further  investigation. 

The  extensibility  of  the  proposed  performance  metric  to  tasks  with  more  than  two  classes  is  quite  plausible, 
but  much  remains  to  be  done  here  as  well.  Preliminary  work  in  this  direction  suggests  that  it  may  be  possible 
to  apply  an  extension  of  L’Hopital’s  rule  to  the  integrals  in  Eq.  15  in  the  situation  where  they  approach  0  due  to 
dimensionality  considerations.  However,  the  resulting  limit  appears  to  depend  strongly  on  the  underlying  data 
PDFs  (a  counterintuitive  result  given  the  behavior  of  two-class  near-guessing  observers,  whose  ROC  curves  all 
approach  the  diagonal  line  regardless  of  the  data  PDFs).  More  careful  work  will  be  necessary  to  validate  or 
refute  these  claims. 

Related  to  the  issue  of  dimensionality  just  mentioned  is  the  situation  of  the  “discrete”  observer,  i.e.,  an 
observer  which  operates  only  at  discrete  operating  points  in  ROC  space  (this  applies  to  the  two-class  observer  as 
well  as  those  with  more  classes) .  We  have  so  far  been  unable  to  usefully  generalize  the  definition  of  7#  and  thus 
Eq.  15  to  this  situation,  even  in  the  two-class  case.  It  remains  to  be  seen  whether  this  last  issue  is  an  important 
one  or  not. 


6.  CONCLUSIONS 

We  have  proposed  a  novel  ROC  performance  metric,  the  SAEC.  Although  grounded  in  the  same  theoretical 
framework  as  the  expected  utility  of  the  ideal  observer,  its  practical  realization  involves  readily  comprehensible 
quantities  —  the  AUC  and  the  arc  length  along  the  ROC  curve  in  a  two-class  task,  and  the  surface-averaged 
integral  of  a  well-defined  scalar  in  a  task  with  more  than  two  classes. 

Although  the  properties  of  this  performance  metric  have  yet  to  be  thoroughly  investigated,  preliminary  results 
are  quite  encouraging.  We  have  high  hopes  that  this  performance  metric  will  allow  comparison  of  observers  in 
classification  tasks  of  varying  complexity,  without  suffering  from  the  drawbacks  that  other  performance  metrics, 
such  as  the  HUH,  have  been  shown  to  possess. 
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